ethernet switches (more ethernet)

outline

architectural considerations
	repeater (hub) - layer 1
	bridge/switch - layer 2
	router - layer 3
	
bridges
STP
vlans

------------------------------------------------------
1. network design architectural considerations

	repeater (hub) - layer 1
	bridge/switch - layer 2
	router - layer 3

hw level considerations
	chassis-based or fixed ports?
	number and kind of ports?
	note switch to switch interconnects vs
		switch to router/host
	if a big 'un, does it have power redundancy?
	number of slots?
	cpu card and cpu card redundancy?
	how big rack-wise?
	how hungry power-wise?
	upstream aggregation done how?
		bigger pipe?
		channel-bonding?

2. bridge 101

broadcast domain
	where broadcast can go
collision domain
	where collisions can occur

collisions create RUNTS, aka "shrapnel" (at least by me)
GIANTS may be created as well (or by other problems)
	1 MTU packet glued to another MTU sized packet.

classical repeaters extend the collision domain

switches MAY or MAY not extend the collision domain
	if store & forward may not
	cut thru could

	classical bridges are store and forward devices,
	a switch may NOT be (cut-thru).

Cisco recommended rule of thumb for hosts per lan
	10BASE5 - 100
	10BASE2 - 30

So: how many hosts in a broadcast domain?
	1 may be too many :->

80/20 rule for bridges

	80% of traffic on local segment, 20% crosses over.
	He's dead Jim ... (this doesn't apply as well as it
	used to)

Switches

	Cisco purchased Kalpana in 1994.

	purchased Grand Junction Networks in 95 (1900s)

	hint: cisco switch CLI may not be same as router CLI

	switch is multiport bridge.
	simultaneous thruputs, therefore bus/backplane max speed is
		important

learning or transparent bridging
 
algorithm might start here:
	
	read a packet
	learn NEW src address or refresh aging timer
		if NEW add to bridging table
		if not NEW, update aging timer
	acc to dst
		note: unicast may may not be in bridging table
		multicast/broadcast or unknown unicast (not in table)
	if so
		flood
	are src/dst on same i/f
	if so
		drop
	if not
		forward to other i/f 

in 802.1d we have 5 processes

	learning
	flooding
	filtering
	forwarding
	aging

learning

	learn acc. to above algorithm
	what happens if:
		A pings B, and B does not exist ...
		A pings B and A and B are on the same 10BASE
			broadcast domain (single upstream switch port)
		A pings B, and A and B are on different ports

flooding
	if dst B is broadcast or unicast, and B is NOT in the bridge
		table,

	flood it ... out all ports but input port

regarding cut-thru vs store/forward.  It is possible
	that a 3rd mode MAY exist ...  "fragment-free".
	The basic idea is that you drop a packet if there is
	an error in the 1st 64 bytes.

	cut-thru only looks at DA, 1st 6 bytes.

2900s, 5500s do store and forward, may have "adaptive cut-thru"
1900, does fragment-free.

question:  if you believe your cabling is causing hw level errors
	(bit rot ...),  would you be better off with cut-thru or
	store and forward?  note: FCS or framing errors may indicate
	hw level signal problems.

----------------------------------
VLAN notes:

	with traditional bridges, we had a pretty good idea
	where the broadcast domain ENDED ...

	now we may not ...   a VLAN could easily cross buildings,
	and go over a wide area

study question:
	why do we need a shared broadcast domain for devices to
		communicate "directly" on an ethernet?

802.1Q: IEEE vlans

	Cisco came up with 1 proprietary way to do VLANS (ISL),
	and so did vendors.  IEEE standard emerged.

	can have shared vlan (svlan) or independent VLAN (ivl)
	svlan: mac address can only appear once in entire switch
		vlan tables; i.e., can only appear once in set of vlans.
	ivlan: not the case (at this point we consider Sun hosts0
		two MACS, one mac address.

	another failure possibility: router link bridges some protocols
		and routes others.
		Question is: what does router do with node's MAC
		address.  If it leaves it alone (bridge), it might
		seem to a switch that the host moved!

	802.1Q only support 1 spanning tree
		for a bunch of vlans.  This is not good ...
		Better: 1 vlan, 1 spanning tree.

what are they for:

	physical link drives which network.  not anymore ...
	You may not need to move a user, as opposed to reconfigure
	1 to N switches.

	security: in the sense of limiting arp broadcast spoofing,
		separate vlans may have a use.

	security: you can't even capture somebody else's broadcast
		traffic

	fault-tolerance: limitation of broadcast traffic.  appletalk
		makes a lot of it.  so does usoft.  you can contain
		contagion.

	broadcasts are useful and evil.  evil in the sense that
	they CANNOT be filtered out by NICs, and a layer 3 decision
	is typically necessary to determine a packet is not wanted.

what they are not for:

	cross-campus VLANS.  why?  (broadcast domains and spanning-trees)
	
protocols exist for distributed switch management of vlans

	assume you have VLAN 51 ...  do you want it everywhere (by magic),
	or do you want to manage from one point (as opposed to logging in to
	all those switches), and control where it goes.

	vlan/STP interaction:  if 1 vlan, 1 spanning tree, and vlan 51 goes
	everywhere by default, a STP reset can upset it.  This is much
	worse if there are multiple vlans, and only 1 spanning tree.
	
----------------------------------

STP notes:

we'll assume 1 vlan, 1 Span Tree until we know better.

"In fact, STP often accounts for more than 50% of the configuration,
troubleshooting, and maintenance headaches in real-world campus
networks" ...  Cisco LAN book.
 
why STP?

	we want redundancy, and ease of hook-up and we do not
	want a loop broadcast meltdown

	consider:

	switch 1

	|      |	two links, same broadcast domain    
	|      | --- host 1	

	switch 2

host 1, sends one broadcast ping (or a packet to an unknown unicast dst)

why loops?

	1. oops
	2. redundancy
	3. load-balancing (may I have more please0

STP is a layer 2 protocol that gives us a loop-free acyclic
single broadcast domain.  

It uses flooded broadcast "hellos" (multicast actually) combined
with a finite-state machine that terminates with a port either
forwarding or shutdown.

important point: layer 2 loops are more dangerous than layer 3 loops.
	why?

when does it stop:

	1. the end of time
	2. you power-cycle the switch
	or break the link

exponential growth in broadcasts may occur.

"I have witnessed a single ARP filling two OC-12 ATM links
for 45 minutes".... "this is bad".

another flaw:
	unicast pkt to no known dst.  bridge table entry for sender
	will flop back and forth.  it won't work until it times out.

two key spanning-tree ideas:

	0. lowest MAC always wins.
	1. bridge id.  
	2. path cost

bridge id
	8 bytes,  2 bytes of bridge priority + 6 bytes of MAC

	bridge priority is not the same as port priority.
	default bridge priority is 32768.

path cost

	cost was 1000 mbits divided by bandwidth in mbits.

	bridges use link speed to decide who is closer/farther 
	away from the "root".  (the tree root).
	however they never considered gigabit anything.
	E.g., 10BASE link has a cost of 100,  1000/10.
	100BASE, would be 10.  One option is to use 1
	for >= gigabit, but that is flawed.

	IEEE decided to make cost non-linear.

	speed		stp cost
	4 mbps		250
	10		100
	100		19
	1g		4
	10g		2

this will last for awhile.  (lookup table to implement)
Values were chosen so old and new schemes will interoperate.

-> lower costs are better <--

how it works:

	4 step decision process is used to tie break

	1. lowest root BID
	2. lowest path cost to root bridge
	3. lowest sender bid
	4. lowest port id

keep a BPDU for each port.  choose the best one as we converge
over time to a 'winner' (a root), considering all BPDUs received
on that port, including the one we send.

note: we start out promoting self as root, but
stop doing that if we get a better root (we always flood something
however if a port is not non-designated).  we restart if
we don't hear for 20 seconds.  (10 retries then)

protocol:

	3 steps
	1. elect root bridge
	2. elect root ports
	3. elect designated ports

1. elect root bridge aka root war

	assume you are the root

	lowest BID wins (the mac address)

	bid  = priority.MAC

	BPDUs are sent every 2 seconds.  they have the bid in them.

topology BPDU:
	root BID | root path cost | sender BID | sender port id 

	forwarding system: puts root BID in as acc. to who it thinks
		is root

	sender BID is always self

	note: a new bridge with a lower root BID will cause
	the STP algorithm to be recalculated.  

	root path cost is cumulative (2 100BASE == 19 + 19)

2. elect root ports

	every non-root bridge will select one root port.
	the root port is the one that is closest to the root.


	consider

				A the root

		B				C

	C will know BPDUs sent to it from B, are not as good
	as A because of 2X the path cost, it will therefore ignore them.
	The C to A port will be the root port.

3. elect designated ports

	each segment in a rooted segment can have only one designated
	port

	e.g., above B and C connect to the same segment.  One of
	them must drop out.

	Assume that B has the designated port, and is the designated
	bridge for that segment.

	path cost is chosen 1st (ok B has better cost, I'm C ... I'll
	drop out).  if a tie, then determination process is used.

	lowest root bid < lowest path cost < lowest sender BID <
		lowest port ID

result:
	root and designated ports forward
	non-designated ports do not forward

----------------------
convergence:
	all ports are state == { blocking, forwarding }

	There exists a state machine, because things may change.

	reboot
	rewire
	port shutdown
	backhoe
	idiot in the wiring closet

----------------------

state

forwarding - send/recv data
learning - building bridging table
listening - building active topo, we believe the tree does
	not exist (e.g., new root, etc.), send/recv BPDUs only
blocking - receives BPDUs, but does not do anything else
disabled - administratively down


roughly:

	blocking -> listening -> learning -> forwarding (or not)
		     15 sec        15 sec

	+ 20 seconds barring topo change notification that
		you are toast and must start over
parts start out as blocking, and work their way up to forwarding
	(or not).

e.g., we are blocked, we hear a NEW root bpdu, we go to listening

listening state: is where previous convergence algorithm occurs

if you are in listen state for 15 seconds and are not non-designated,
	you may go to learning state.

learning state: another 15 seconds here please.  Here you try and
	build a bridge table in order to be efficient.
	this is an anti-flooding step.

then may go to forwarding.

so if we fail, how long can it take.  30 to 50 seconds ...
50 because you may have to wait 20 seconds in addition for
the "root" BPDU to timeout, or you may learn that a new root
war is taking place or know that your own port failed.

STP timers can be modified only from the root bridge 
as timer fields exist in the BPDU.  Modifying them at non-root
bridges may have no effect.

timers;
max age: 20 seconds (time before we start all over again)
hello: 2 sec
forward: (state transition time) 15

--------------
BPDU packet format:
	there are two kinds of packet
	hello 
	topo change notice (shriek! start over!!!)

in 802.3 wrapper; i.e., ssap/dsap/unnumbered frame (ui)
	then data

data in 35 bytes:
----------------
protocol id = 0
version = 0
type - hello or topo change  notice
		0 - config
		0x80 - topo change
flags - may have import
root bid
root path cost
sender bid
port id, must be unique per sender, 
message age
max age
hello
forward delay 

topo change packet ends at type field.

--------------

why topo change?

	because bridging table timeout is typically 300 seconds,
	therefore you may be in for a long wait)

topo change packet alerts other switches that we must start over,
	and this is also why listening state can be useful.

--------------
why worry about who is the root bridge?

consider:

		B		C	
	A		D		Z
		E		F

It could be that Z will be the root, and packets from A to E
otherwise directly connected, always go thru Z.

meta-question:
	why is spanning tree "timeout" mechanism so conservative?