IP routing - supplemental points

contents:
--------
ethernet address
arp - 
switches - 
routing table algorithms
other ip address semantics
routing table entries a la Cisco
CIDR notes
routing protocol comparision chart
end system discovery of router

-----------------------------------------------
ethernet addresses
	48 bits
	unicast - one node to another
			010203040506
	broadcast - one to many
			ffffffffffff
	multicast - one to many mapping
		01 prefix

	note: ethernet addresses have IP address mapping
		unicast - traditional IP address
		broadcast - two kinds
		multicast - 224.*.*.*

	broadcast/multicast are fundamental ideas
	often used for All Points Bulletin
		ARP
			arp request (since you don't know the other parties
				address)
		RIP
			every packet
		OSPF  etc ...  
			hello

	multicast can be viewed here as an optimization on broadcast
		since parties who do not care (IPX router) can be left out
-----------------------------------------------
arp - 

% ping X where X is on same link
IP layer when wants to send packet to next hop,
	has IP dst, needs MAC address
map IP address to ethernet courtesy of BROADCAST
same layer as IP; i.e., ethernet header contains ARP or IP

arp request/arp reply

request effectively 4-tuple (src IP, src MAC, dst IP, dst MAC=?)
unicast reply 

reply simply contains dst MAC to complete 4-tuple

note that sender typically caches response
listeners may cache broadcast
sender may send grat. arp at i/f boot where IP == self, MAC == self,
	desire to learn if anyone else answers, therefore learn if
	1.
	IP addresses are locally misconfigured
	2. also rewrite any IP address/MAC address mappings so listeners
	can refresh cache mapping of IP/MAC (possibly wrong)

note anyone can reply (proxy)

overview/summary:
	1. IP to MAC address mapping
	2. some sneaky possibilities (including lack of security)
	3. broadcast is fundamental to learning about existance
	4. no idea of subnetting here but ARP + IP subnet is basis
	of link-layer routing
	(if subnet indiciates same link, then use ARP, else send to
	a default router)
-----------------------------------------------
switches
	packet switch vs circuit switch

	packet switch - switch on IP dst; i.e., per packet,
		not per connection
		hence packets may take different path with 2 packets
			sent to same IP dst

	circuit switch
		sets up "connection" which involves routing ... path across
			circuit switches

	once route is setup on circuit switch, packets flow according
		to internal mapping of
		(input port X, output port Y)

	notion of "fate sharing" as deemed bad.  

	if one *intermediate* switch fails, is end to end circuit hosed?
-----------------------------------------------
other IP routing algorithms:

1. classful:
	given: IP dst
		routing tuple == (IP dst, gateway, i/f, metric)

	if host-specific match
		send to next hop address
	else if IP dst subnet match and we have i/f on that subnet
		send to "network" (use arp)
	else if network match 
		send to next hop that leads to that network
	else no match
		if I have default route/s
		choose one and send to that gateway
		else
			drop packet
			if intermediate router
				send ICMP unreachable message

notes:
	1. typically (IP dst, route entry) cached to save cost
		of next lookup

	2. classful routing algorithm prioritization of
		most matching bits in routing table DST leads to classless/CIDR
		point of view
		more bits to lesser bits (default route is least bits)

	3. next hop gateway means of course IP dst in IP part NOT changed,
	we simply send to "mac" address in ethernet unicast fashion

2. classless

	assumption:
		routing table entries:
			(IP dst, subnet mask, gateway, metric, interface)

	host entry has mask with all ones.
	default has mask with all 0s.
	subnet has appropriate mask
			9.3.4.5,  255.0.0.0 (/8)
			129.95.0.0  255.255.0.0 (/16)
			129.95.1.0  255.255.1.0 (/24)
			0.0.0.0   0.0.0.0 (/0) 

algorithm is simple:
	search routing table for LONGEST prefix match
	in parallel:
		ip dst in packet && all entry subnet masks
			and compare with ip dst in entry

example: try these with routing table above; i.e., when
	entry do we get if we assume the IP dst == the following
	9.3.4.5
	1.2.3.4
	129.95.1.3

	Of course this is simple.  It becomes more interesting
	in the presence of routing aggregation and more specific
	routes that "override" an aggregate.

classless routing algorithm allows SUPERNETS and specific overrides
	of supernets to work

	CIDR - supernet means route aggregation acc. to power of 2.

	consider:
	200.1.*.*/16 (255.255.0.0) can stand for
		200.1.0  to 200.1.255

	so we can have TWO routing table entries:

		To 200.1.0.0/255.255.0.0  via G1
		To 200.1.2.3/255.255.255.255  

	therefore:
	supernetting (part of CIDR) can allow less specific routes and give
		us route aggregation as well (less routing table entries)

	supernetting means: mask contains LESS bits than prefix
		result is an aggregate typically of many class C networks

	another example:  a ISP has been given the CIDR block:
		198.24.0.0/13
			192.24.0.0 to 192.31.0.0 

	how many addresses do they have?

BSD trie algorithm:
	See Wright/Stevens, TCP/IP Illustrated, vol.2 pp 559-
		1995 Addison/Wesley ISBN 0-201-63354-X

	in BSD 4.2/3 we had a classful routing algorithm with two tables
	1. host addresses
	2. net addresses

	used hash/linear search to go through both in order.
	ignore subnet masks (those were bound to interfaces)

	hierarchy implied
	problem with this host?

	le0:9.1.2.3/255.0.0.0    ...    le1:192.1.2.3/255.255.255.0

in 4.4 BSD improved using trie or Patricia tree structure.
	algorithm is address-family independent (can be used with OSI)

basic ideas:
	routing table organized as binary tree.
	supports classless lookup, each entry has associated net mask.

	-> entry matches search key if search key ANDed with mask of
		entry equals the entry itself

	internal structure:
			default route has mask: 0.0.0.0  (ip dst & 0 == 0 ... hence match)
			host entries have implicit all 1's match

	tree = nodes + leafs 

	test bits on/off left to right, 

					node (internal)  1st bit

				| (off)    |  (on)
				V          V

	backtracking is used when host leafs are found and do not actually
			match the ip dst key

	very efficient and allows large routing tables

	Sklower 1991 showed that the radix tree was 4 times faster than
		the previous hash mechanism

	Sklower, K. 1991 "A Tree-Based Packet Routing Table for Berkeley
		Unix",  USENIX, Dallas Texas.

--------------------------------
Cisco routing table organizational ideas:
From Inside Cisco IOS Software Architecture, Bollapragada,
	Murphy, White, Cisco Press, 2000

Depending on hw/sw system/organization ... how do we switch packets
	quickly?

Cisco IOS traditionally an embedded os with tasks.  No memory protection.  

Some Cisco switching buzzwords include:

1. Process switching
2. fast switching
3. Cisco Express Forwarding (CEF)

-------
1. Cisco Process switching - universal, brute force, ...

On input:
	network interface put pkt in I/O memory (input Q)
	interrupts  CPU
	ip_input task must run

	ip_input makes packet forwarding decisions

		makes routing lookup (assume CIDR/not clear on 
				routing lookup alg.)
		must get address of next hop router
			and implicit i/f (note this is src MAC)
		must get ARP/MAC address for NHR
		rewrites MAC header in IO memory

	queues packet for transmission on output Q on output i/f
		assuming there is memory
	post transmission, IO memory is freed

	note 3 key pieces of info needed:

		1.next hop IP

		2.next i/f

		3.Mac hdr info, especially next hop dst

con:
	slow ...

	task must switch in
	routing table lookup per packet
		question of performance as routing table size increases!
	memory copies from nic memory to cpu memory may be slow/costly
		may have in memory data copy
	CPU processing of packet switching can interfere with
		cache building (ok, we aren't to that one yet).

If the routing table supports > 1 path to the same destination
	(possibly ECMP, or unequal with EIGRP), process switching
	can load share

con:
	can result in out of order packets
--------------------------------

Question: what can we cache here?

	popular arp/next hop based on ip dst
	in theory, we may be able to "forward" during receive phase
		if packet forwarding info is available in cache
		and not bump upstairs to ip task. 

Call this fast switching:

	ip_input
		after lookup, adds step to cache info in "fast cache"
	
	interrupt software now 1st searches fast cache, else
		Qs packet for ip_switch

	if fast cache found,
		rewrite MAC header, and do output function
		ip_input not involved

call this: route once, forward many times

1st packet is process switched, others may not be.

We may now have cache coherency problems.
		arp table may change
		routing table may change
		new packets we haven't seen before (120k routes
			in core routing table)
		cache thrashing possible

Cisco ios show command for cache

# show ip cache verbose

Construction of fast cache data structure

1. first implemented as hash structure

	hash ip dst to

	hash table:
		1..6 ip prefix/length,  pointer to pre-formed MAC hdr
		1..6 ip prefix/length,  pointer to pre-formed MAC hdr
		1..6 ip prefix/length,  pointer to pre-formed MAC hdr
		...

2. hash table replaced with 2-way radix tree a la Sklower

			0
		     v      v  (go left for 0, right for 1)
		   0	    1
		0     1   0   1

we simply search down the tree based on binary 2/prefixes, to look
	up the ip dst as a binary bit string

(Assumptions: assume N bits, e.g., 7 is 0111, and leaf nodes
	are where we store numbers)

3. problem: when we maintain the cache, we can't distinguish
	between overlapping address ranges; e.g.,

	131.252/16
	131.252.1.2/32

We can't solve this by storing all possible IP dst addresses.   
Certain heuristic rules are used to solve this; e.g.,
		if equal cost path, cache with /32

		if major network with subnets, cache with
			biggest major network/prefix

cons:
	cannot easily support load sharing.
	1st packet is process switched.  Nth packets are not.
		They all go to the same cache entry.

Newer switching mechanisms (CEF) support this.
--------------------------------
Optimum Switching:

	fast switching with optimizations:

	fast switching is generic, optimum is optimized for certain CPUS

	cache here is accessed via 256-way multiway tree

	Each parent node can have 256 descendants; e.g.,

	root
	| .... |
	1      256

	each of those is used for one address of the 4 byte IP
		dotted decimal address scheme

	root
	1 2 3 4 5 6. 7. 8. 9. 10.
			      |
			      v
			      10.1  10.2 10.3  

	etc.

	4 levels max ...  a.b.c.d
	
--------------------------------
CEF:  Cisco Express Forwarding (takes hw support)
	in IOS 12.0

	cons up to now:

	1. no support for overlap CIDR ranges
	2. change in route table/arp table  causes invalidation
	of large parts of cache
	3. 1st packet must be process switched
	4. load balancing may not be available

The above CONS may be ok in the local enterprise, probably not
	in transit router.

Cisco command
# show ip cef summary
# show adjacency 

CEF overview:
	builds own structures that directlyh mirror routing table/arp table

	CEF table
		view as routing table, implemented as 256-way trie table
		(as with optimum)
	adjacency table, contains MAC hdr, and other info
		each trie table leaf, say 10.0.0.1, points to
		an adj. entry, view as arp table

Process switching is not part of the picture; that is, the
	1st packet is not process switched.

why? because the CEF tables are built and maintained along with the
	route/arp tables.  Not as a side effect.

Therefore packets are forwarded during recv. interrupts.

Load share problem is "solved": 

CEF uses the principle of indirection in that 
	an adjacency entry can be replaced by a load-share entry
	(multiple MAC) which can in turn point to adj. entries.

	per-destination load balancing (default)
		per ip src, ip dst, all packets go one path

	need many pairs for load balancing to be effective

	per-packet load balancing (not default)
		traditional round-robin per flow
	
--------------------------------
--------------------------------
other address semantics
	private addresses
		see RFC 1916
		(A) 10.0.0.0/8
		(B) 172.16.0.0/12  (to 172.31.255.255)
		(C) 192.168.0.0/16 (to 192.168.255.255)

	basis of NAT
		typically map say a class C range into say 10.0.0.0
		e.g., NAT router does
				200.1.2.3 -> 10.1.2.3 out

	localhost
		127.0.0.1

	variable length subnet masks
		not possible with RIP
		must send (IP route/mask) in dynamic routing protocols
			(.e.g., RIP II, not RIP I)
		subnet masks ARE contiguous (no hole), but we may be able
		to use different lengths in an intranet
			(be careful... e.g., SunOS can't handle it)

--------------------------------
routing table entries:

from one of view, one could claim that there are four kinds
	of routes in routing table (e.g., Cisco router)

	1. connected routes, directly connected to i/fs
	Ethernet0, Ethernet1, Serial0, Serial1

	2. static routes, inserted by hand because YOU know better
	(may be all you need after all)

	3. interior routing protocol routes inserted dynamically by
		RIP/OSPF/EIGRP

	4. external routes - inserted by BGP4

	Cisco has multiple logical views on
	routing tables (e.g., BGP and EIGRP) and
	we may have the same competing routes for one dst but one
	RIB or unified routing table (routing info base).

	Also consider we may have interfaces go down... that should invalidate
	associated routing table entries.

	connected routes typically subnet associated with interface
	Ethernet0 is 129.95.1.2,  then probably 129.95.1.0 associated with that i/f

	static route may be used for default route or to point to router
		that does not speak dynamic routes for whatever reason

	ip route 0.0.0.0 0.0.0.0 Ethernet0  (default route)

	Can add metrics to change the preference of the route and thus stay unused
		normally but be used if the other normal route goes away
		(because interface fails)

	ip route 0.0.0.0 0.0.0.0 Ethernet0
	ip route 0.0.0.0 0.0.0.0 Serial0 10

	normally don't use Serial0 (ISDN backup), but used if Ethernet0
		goes away

CIDR - a few other points

Some interesting urls:

1. bgp routing table may be growing too fast again:

	CIDR report
2005 - 150k or so
		http://bgp.potaroo.net/cidr


1. given CIDR-ization; i.e., classless network POV
	.all routing protocols must change to support it.
	.by definition RIP v1 can't ...  thus RIP v2
	.route table entry must become (ip dst, netmask) and
	routing protocols must support shipping that info (RIP v1
	and Cisco IGRP don't).
	.routing protocols that support CIDR include
			BGP-4 (up from BGP-3)
			OSPF
			RIPv2
			EIGRP (new cisco igrp)

2. end systems SHOULD (but old one's will not) support CIDR
	in routing algorithm and routing table

3. The CIDR discovery is pre-IPnextgen ...  but allocation of
class C address range is still only 1/8th of space, therefore
IPng addresses should address that problem.  Schemes for class A
allocation exist.

4. address renumbering ?  implicit with provider-based allocation.
If you change providers, you will have to change internal addressing.

auto-discovery problem
	starts with own IP address
	(IP address, subnet mask, local default router, broadcast)

	own IP address provided by
		rarp
		bootp/dhcp

bootp/dhcp
	can learn default router

ICMP router discovery message
	advertisement/soliciation
	(used and extended in Mobile-IP to discover agent)
	 
dynamic routing protocols
	RIP sends default
	con: may be undesirable to have this info on leaf link
		as end systems are passive (not routers)
		and may be stupid (you don't want to run rip)