ethernet switches (more ethernet) outline architectural considerations repeater (hub) - layer 1 bridge/switch - layer 2 router - layer 3 bridges STP vlans ------------------------------------------------------ 1. network design architectural considerations repeater (hub) - layer 1 bridge/switch - layer 2 router - layer 3 hw level considerations chassis-based or fixed ports? number and kind of ports? note switch to switch interconnects vs switch to router/host if a big 'un, does it have power redundancy? number of slots? cpu card and cpu card redundancy? how big rack-wise? how hungry power-wise? upstream aggregation done how? bigger pipe? channel-bonding? 2. bridge 101 broadcast domain where broadcast can go collision domain where collisions can occur collisions create RUNTS, aka "shrapnel" (at least by me) GIANTS may be created as well (or by other problems) 1 MTU packet glued to another MTU sized packet. classical repeaters extend the collision domain switches MAY or MAY not extend the collision domain if store & forward may not cut thru could classical bridges are store and forward devices, a switch may NOT be (cut-thru). Cisco recommended rule of thumb for hosts per lan 10BASE5 - 100 10BASE2 - 30 So: how many hosts in a broadcast domain? 1 may be too many :-> 80/20 rule for bridges 80% of traffic on local segment, 20% crosses over. He's dead Jim ... (this doesn't apply as well as it used to) Switches Cisco purchased Kalpana in 1994. purchased Grand Junction Networks in 95 (1900s) hint: cisco switch CLI may not be same as router CLI switch is multiport bridge. simultaneous thruputs, therefore bus/backplane max speed is important learning or transparent bridging algorithm might start here: read a packet learn NEW src address or refresh aging timer if NEW add to bridging table if not NEW, update aging timer acc to dst note: unicast may may not be in bridging table multicast/broadcast or unknown unicast (not in table) if so flood are src/dst on same i/f if so drop if not forward to other i/f in 802.1d we have 5 processes learning flooding filtering forwarding aging learning learn acc. to above algorithm what happens if: A pings B, and B does not exist ... A pings B and A and B are on the same 10BASE broadcast domain (single upstream switch port) A pings B, and A and B are on different ports flooding if dst B is broadcast or unicast, and B is NOT in the bridge table, flood it ... out all ports but input port regarding cut-thru vs store/forward. It is possible that a 3rd mode MAY exist ... "fragment-free". The basic idea is that you drop a packet if there is an error in the 1st 64 bytes. cut-thru only looks at DA, 1st 6 bytes. 2900s, 5500s do store and forward, may have "adaptive cut-thru" 1900, does fragment-free. question: if you believe your cabling is causing hw level errors (bit rot ...), would you be better off with cut-thru or store and forward? note: FCS or framing errors may indicate hw level signal problems. ---------------------------------- VLAN notes: with traditional bridges, we had a pretty good idea where the broadcast domain ENDED ... now we may not ... a VLAN could easily cross buildings, and go over a wide area study question: why do we need a shared broadcast domain for devices to communicate "directly" on an ethernet? 802.1Q: IEEE vlans Cisco came up with 1 proprietary way to do VLANS (ISL), and so did vendors. IEEE standard emerged. can have shared vlan (svlan) or independent VLAN (ivl) svlan: mac address can only appear once in entire switch vlan tables; i.e., can only appear once in set of vlans. ivlan: not the case (at this point we consider Sun hosts0 two MACS, one mac address. another failure possibility: router link bridges some protocols and routes others. Question is: what does router do with node's MAC address. If it leaves it alone (bridge), it might seem to a switch that the host moved! 802.1Q only support 1 spanning tree for a bunch of vlans. This is not good ... Better: 1 vlan, 1 spanning tree. what are they for: physical link drives which network. not anymore ... You may not need to move a user, as opposed to reconfigure 1 to N switches. security: in the sense of limiting arp broadcast spoofing, separate vlans may have a use. security: you can't even capture somebody else's broadcast traffic fault-tolerance: limitation of broadcast traffic. appletalk makes a lot of it. so does usoft. you can contain contagion. broadcasts are useful and evil. evil in the sense that they CANNOT be filtered out by NICs, and a layer 3 decision is typically necessary to determine a packet is not wanted. what they are not for: cross-campus VLANS. why? (broadcast domains and spanning-trees) protocols exist for distributed switch management of vlans assume you have VLAN 51 ... do you want it everywhere (by magic), or do you want to manage from one point (as opposed to logging in to all those switches), and control where it goes. vlan/STP interaction: if 1 vlan, 1 spanning tree, and vlan 51 goes everywhere by default, a STP reset can upset it. This is much worse if there are multiple vlans, and only 1 spanning tree. ---------------------------------- STP notes: we'll assume 1 vlan, 1 Span Tree until we know better. "In fact, STP often accounts for more than 50% of the configuration, troubleshooting, and maintenance headaches in real-world campus networks" ... Cisco LAN book. why STP? we want redundancy, and ease of hook-up and we do not want a loop broadcast meltdown consider: switch 1 | | two links, same broadcast domain | | --- host 1 switch 2 host 1, sends one broadcast ping (or a packet to an unknown unicast dst) why loops? 1. oops 2. redundancy 3. load-balancing (may I have more please0 STP is a layer 2 protocol that gives us a loop-free acyclic single broadcast domain. It uses flooded broadcast "hellos" (multicast actually) combined with a finite-state machine that terminates with a port either forwarding or shutdown. important point: layer 2 loops are more dangerous than layer 3 loops. why? when does it stop: 1. the end of time 2. you power-cycle the switch or break the link exponential growth in broadcasts may occur. "I have witnessed a single ARP filling two OC-12 ATM links for 45 minutes".... "this is bad". another flaw: unicast pkt to no known dst. bridge table entry for sender will flop back and forth. it won't work until it times out. two key spanning-tree ideas: 0. lowest MAC always wins. 1. bridge id. 2. path cost bridge id 8 bytes, 2 bytes of bridge priority + 6 bytes of MAC bridge priority is not the same as port priority. default bridge priority is 32768. path cost cost was 1000 mbits divided by bandwidth in mbits. bridges use link speed to decide who is closer/farther away from the "root". (the tree root). however they never considered gigabit anything. E.g., 10BASE link has a cost of 100, 1000/10. 100BASE, would be 10. One option is to use 1 for >= gigabit, but that is flawed. IEEE decided to make cost non-linear. speed stp cost 4 mbps 250 10 100 100 19 1g 4 10g 2 this will last for awhile. (lookup table to implement) Values were chosen so old and new schemes will interoperate. -> lower costs are better <-- how it works: 4 step decision process is used to tie break 1. lowest root BID 2. lowest path cost to root bridge 3. lowest sender bid 4. lowest port id keep a BPDU for each port. choose the best one as we converge over time to a 'winner' (a root), considering all BPDUs received on that port, including the one we send. note: we start out promoting self as root, but stop doing that if we get a better root (we always flood something however if a port is not non-designated). we restart if we don't hear for 20 seconds. (10 retries then) protocol: 3 steps 1. elect root bridge 2. elect root ports 3. elect designated ports 1. elect root bridge aka root war assume you are the root lowest BID wins (the mac address) bid = priority.MAC BPDUs are sent every 2 seconds. they have the bid in them. topology BPDU: root BID | root path cost | sender BID | sender port id forwarding system: puts root BID in as acc. to who it thinks is root sender BID is always self note: a new bridge with a lower root BID will cause the STP algorithm to be recalculated. root path cost is cumulative (2 100BASE == 19 + 19) 2. elect root ports every non-root bridge will select one root port. the root port is the one that is closest to the root. consider A the root B C C will know BPDUs sent to it from B, are not as good as A because of 2X the path cost, it will therefore ignore them. The C to A port will be the root port. 3. elect designated ports each segment in a rooted segment can have only one designated port e.g., above B and C connect to the same segment. One of them must drop out. Assume that B has the designated port, and is the designated bridge for that segment. path cost is chosen 1st (ok B has better cost, I'm C ... I'll drop out). if a tie, then determination process is used. lowest root bid < lowest path cost < lowest sender BID < lowest port ID result: root and designated ports forward non-designated ports do not forward ---------------------- convergence: all ports are state == { blocking, forwarding } There exists a state machine, because things may change. reboot rewire port shutdown backhoe idiot in the wiring closet ---------------------- state forwarding - send/recv data learning - building bridging table listening - building active topo, we believe the tree does not exist (e.g., new root, etc.), send/recv BPDUs only blocking - receives BPDUs, but does not do anything else disabled - administratively down roughly: blocking -> listening -> learning -> forwarding (or not) 15 sec 15 sec + 20 seconds barring topo change notification that you are toast and must start over parts start out as blocking, and work their way up to forwarding (or not). e.g., we are blocked, we hear a NEW root bpdu, we go to listening listening state: is where previous convergence algorithm occurs if you are in listen state for 15 seconds and are not non-designated, you may go to learning state. learning state: another 15 seconds here please. Here you try and build a bridge table in order to be efficient. this is an anti-flooding step. then may go to forwarding. so if we fail, how long can it take. 30 to 50 seconds ... 50 because you may have to wait 20 seconds in addition for the "root" BPDU to timeout, or you may learn that a new root war is taking place or know that your own port failed. STP timers can be modified only from the root bridge as timer fields exist in the BPDU. Modifying them at non-root bridges may have no effect. timers; max age: 20 seconds (time before we start all over again) hello: 2 sec forward: (state transition time) 15 -------------- BPDU packet format: there are two kinds of packet hello topo change notice (shriek! start over!!!) in 802.3 wrapper; i.e., ssap/dsap/unnumbered frame (ui) then data data in 35 bytes: ---------------- protocol id = 0 version = 0 type - hello or topo change notice 0 - config 0x80 - topo change flags - may have import root bid root path cost sender bid port id, must be unique per sender, message age max age hello forward delay topo change packet ends at type field. -------------- why topo change? because bridging table timeout is typically 300 seconds, therefore you may be in for a long wait) topo change packet alerts other switches that we must start over, and this is also why listening state can be useful. -------------- why worry about who is the root bridge? consider: B C A D Z E F It could be that Z will be the root, and packets from A to E otherwise directly connected, always go thru Z. meta-question: why is spanning tree "timeout" mechanism so conservative?