IP routing - supplemental points contents: -------- ethernet address arp - switches - routing table algorithms other ip address semantics routing table entries a la Cisco CIDR notes routing protocol comparision chart end system discovery of router ----------------------------------------------- ethernet addresses 48 bits unicast - one node to another 010203040506 broadcast - one to many ffffffffffff multicast - one to many mapping 01 prefix note: ethernet addresses have IP address mapping unicast - traditional IP address broadcast - two kinds multicast - 224.*.*.* broadcast/multicast are fundamental ideas often used for All Points Bulletin ARP arp request (since you don't know the other parties address) RIP every packet OSPF etc ... hello multicast can be viewed here as an optimization on broadcast since parties who do not care (IPX router) can be left out ----------------------------------------------- arp - % ping X where X is on same link IP layer when wants to send packet to next hop, has IP dst, needs MAC address map IP address to ethernet courtesy of BROADCAST same layer as IP; i.e., ethernet header contains ARP or IP arp request/arp reply request effectively 4-tuple (src IP, src MAC, dst IP, dst MAC=?) unicast reply reply simply contains dst MAC to complete 4-tuple note that sender typically caches response listeners may cache broadcast sender may send grat. arp at i/f boot where IP == self, MAC == self, desire to learn if anyone else answers, therefore learn if 1. IP addresses are locally misconfigured 2. also rewrite any IP address/MAC address mappings so listeners can refresh cache mapping of IP/MAC (possibly wrong) note anyone can reply (proxy) overview/summary: 1. IP to MAC address mapping 2. some sneaky possibilities (including lack of security) 3. broadcast is fundamental to learning about existance 4. no idea of subnetting here but ARP + IP subnet is basis of link-layer routing (if subnet indiciates same link, then use ARP, else send to a default router) ----------------------------------------------- switches packet switch vs circuit switch packet switch - switch on IP dst; i.e., per packet, not per connection hence packets may take different path with 2 packets sent to same IP dst circuit switch sets up "connection" which involves routing ... path across circuit switches once route is setup on circuit switch, packets flow according to internal mapping of (input port X, output port Y) notion of "fate sharing" as deemed bad. if one *intermediate* switch fails, is end to end circuit hosed? ----------------------------------------------- other IP routing algorithms: 1. classful: given: IP dst routing tuple == (IP dst, gateway, i/f, metric) if host-specific match send to next hop address else if IP dst subnet match and we have i/f on that subnet send to "network" (use arp) else if network match send to next hop that leads to that network else no match if I have default route/s choose one and send to that gateway else drop packet if intermediate router send ICMP unreachable message notes: 1. typically (IP dst, route entry) cached to save cost of next lookup 2. classful routing algorithm prioritization of most matching bits in routing table DST leads to classless/CIDR point of view more bits to lesser bits (default route is least bits) 3. next hop gateway means of course IP dst in IP part NOT changed, we simply send to "mac" address in ethernet unicast fashion 2. classless assumption: routing table entries: (IP dst, subnet mask, gateway, metric, interface) host entry has mask with all ones. default has mask with all 0s. subnet has appropriate mask 9.3.4.5, 255.0.0.0 (/8) 129.95.0.0 255.255.0.0 (/16) 129.95.1.0 255.255.1.0 (/24) 0.0.0.0 0.0.0.0 (/0) algorithm is simple: search routing table for LONGEST prefix match in parallel: ip dst in packet && all entry subnet masks and compare with ip dst in entry example: try these with routing table above; i.e., when entry do we get if we assume the IP dst == the following 9.3.4.5 1.2.3.4 129.95.1.3 Of course this is simple. It becomes more interesting in the presence of routing aggregation and more specific routes that "override" an aggregate. classless routing algorithm allows SUPERNETS and specific overrides of supernets to work CIDR - supernet means route aggregation acc. to power of 2. consider: 200.1.*.*/16 (255.255.0.0) can stand for 200.1.0 to 200.1.255 so we can have TWO routing table entries: To 200.1.0.0/255.255.0.0 via G1 To 200.1.2.3/255.255.255.255 therefore: supernetting (part of CIDR) can allow less specific routes and give us route aggregation as well (less routing table entries) supernetting means: mask contains LESS bits than prefix result is an aggregate typically of many class C networks another example: a ISP has been given the CIDR block: 198.24.0.0/13 192.24.0.0 to 192.31.0.0 how many addresses do they have? BSD trie algorithm: See Wright/Stevens, TCP/IP Illustrated, vol.2 pp 559- 1995 Addison/Wesley ISBN 0-201-63354-X in BSD 4.2/3 we had a classful routing algorithm with two tables 1. host addresses 2. net addresses used hash/linear search to go through both in order. ignore subnet masks (those were bound to interfaces) hierarchy implied problem with this host? le0:9.1.2.3/255.0.0.0 ... le1:192.1.2.3/255.255.255.0 in 4.4 BSD improved using trie or Patricia tree structure. algorithm is address-family independent (can be used with OSI) basic ideas: routing table organized as binary tree. supports classless lookup, each entry has associated net mask. -> entry matches search key if search key ANDed with mask of entry equals the entry itself internal structure: default route has mask: 0.0.0.0 (ip dst & 0 == 0 ... hence match) host entries have implicit all 1's match tree = nodes + leafs test bits on/off left to right, node (internal) 1st bit | (off) | (on) V V backtracking is used when host leafs are found and do not actually match the ip dst key very efficient and allows large routing tables Sklower 1991 showed that the radix tree was 4 times faster than the previous hash mechanism Sklower, K. 1991 "A Tree-Based Packet Routing Table for Berkeley Unix", USENIX, Dallas Texas. -------------------------------- Cisco routing table organizational ideas: From Inside Cisco IOS Software Architecture, Bollapragada, Murphy, White, Cisco Press, 2000 Depending on hw/sw system/organization ... how do we switch packets quickly? Cisco IOS traditionally an embedded os with tasks. No memory protection. Some Cisco switching buzzwords include: 1. Process switching 2. fast switching 3. Cisco Express Forwarding (CEF) ------- 1. Cisco Process switching - universal, brute force, ... On input: network interface put pkt in I/O memory (input Q) interrupts CPU ip_input task must run ip_input makes packet forwarding decisions makes routing lookup (assume CIDR/not clear on routing lookup alg.) must get address of next hop router and implicit i/f (note this is src MAC) must get ARP/MAC address for NHR rewrites MAC header in IO memory queues packet for transmission on output Q on output i/f assuming there is memory post transmission, IO memory is freed note 3 key pieces of info needed: 1.next hop IP 2.next i/f 3.Mac hdr info, especially next hop dst con: slow ... task must switch in routing table lookup per packet question of performance as routing table size increases! memory copies from nic memory to cpu memory may be slow/costly may have in memory data copy CPU processing of packet switching can interfere with cache building (ok, we aren't to that one yet). If the routing table supports > 1 path to the same destination (possibly ECMP, or unequal with EIGRP), process switching can load share con: can result in out of order packets -------------------------------- Question: what can we cache here? popular arp/next hop based on ip dst in theory, we may be able to "forward" during receive phase if packet forwarding info is available in cache and not bump upstairs to ip task. Call this fast switching: ip_input after lookup, adds step to cache info in "fast cache" interrupt software now 1st searches fast cache, else Qs packet for ip_switch if fast cache found, rewrite MAC header, and do output function ip_input not involved call this: route once, forward many times 1st packet is process switched, others may not be. We may now have cache coherency problems. arp table may change routing table may change new packets we haven't seen before (120k routes in core routing table) cache thrashing possible Cisco ios show command for cache # show ip cache verbose Construction of fast cache data structure 1. first implemented as hash structure hash ip dst to hash table: 1..6 ip prefix/length, pointer to pre-formed MAC hdr 1..6 ip prefix/length, pointer to pre-formed MAC hdr 1..6 ip prefix/length, pointer to pre-formed MAC hdr ... 2. hash table replaced with 2-way radix tree a la Sklower 0 v v (go left for 0, right for 1) 0 1 0 1 0 1 we simply search down the tree based on binary 2/prefixes, to look up the ip dst as a binary bit string (Assumptions: assume N bits, e.g., 7 is 0111, and leaf nodes are where we store numbers) 3. problem: when we maintain the cache, we can't distinguish between overlapping address ranges; e.g., 131.252/16 131.252.1.2/32 We can't solve this by storing all possible IP dst addresses. Certain heuristic rules are used to solve this; e.g., if equal cost path, cache with /32 if major network with subnets, cache with biggest major network/prefix cons: cannot easily support load sharing. 1st packet is process switched. Nth packets are not. They all go to the same cache entry. Newer switching mechanisms (CEF) support this. -------------------------------- Optimum Switching: fast switching with optimizations: fast switching is generic, optimum is optimized for certain CPUS cache here is accessed via 256-way multiway tree Each parent node can have 256 descendants; e.g., root | .... | 1 256 each of those is used for one address of the 4 byte IP dotted decimal address scheme root 1 2 3 4 5 6. 7. 8. 9. 10. | v 10.1 10.2 10.3 etc. 4 levels max ... a.b.c.d -------------------------------- CEF: Cisco Express Forwarding (takes hw support) in IOS 12.0 cons up to now: 1. no support for overlap CIDR ranges 2. change in route table/arp table causes invalidation of large parts of cache 3. 1st packet must be process switched 4. load balancing may not be available The above CONS may be ok in the local enterprise, probably not in transit router. Cisco command # show ip cef summary # show adjacency CEF overview: builds own structures that directlyh mirror routing table/arp table CEF table view as routing table, implemented as 256-way trie table (as with optimum) adjacency table, contains MAC hdr, and other info each trie table leaf, say 10.0.0.1, points to an adj. entry, view as arp table Process switching is not part of the picture; that is, the 1st packet is not process switched. why? because the CEF tables are built and maintained along with the route/arp tables. Not as a side effect. Therefore packets are forwarded during recv. interrupts. Load share problem is "solved": CEF uses the principle of indirection in that an adjacency entry can be replaced by a load-share entry (multiple MAC) which can in turn point to adj. entries. per-destination load balancing (default) per ip src, ip dst, all packets go one path need many pairs for load balancing to be effective per-packet load balancing (not default) traditional round-robin per flow -------------------------------- -------------------------------- other address semantics private addresses see RFC 1916 (A) 10.0.0.0/8 (B) 172.16.0.0/12 (to 172.31.255.255) (C) 192.168.0.0/16 (to 192.168.255.255) basis of NAT typically map say a class C range into say 10.0.0.0 e.g., NAT router does 200.1.2.3 -> 10.1.2.3 out localhost 127.0.0.1 variable length subnet masks not possible with RIP must send (IP route/mask) in dynamic routing protocols (.e.g., RIP II, not RIP I) subnet masks ARE contiguous (no hole), but we may be able to use different lengths in an intranet (be careful... e.g., SunOS can't handle it) -------------------------------- routing table entries: from one of view, one could claim that there are four kinds of routes in routing table (e.g., Cisco router) 1. connected routes, directly connected to i/fs Ethernet0, Ethernet1, Serial0, Serial1 2. static routes, inserted by hand because YOU know better (may be all you need after all) 3. interior routing protocol routes inserted dynamically by RIP/OSPF/EIGRP 4. external routes - inserted by BGP4 Cisco has multiple logical views on routing tables (e.g., BGP and EIGRP) and we may have the same competing routes for one dst but one RIB or unified routing table (routing info base). Also consider we may have interfaces go down... that should invalidate associated routing table entries. connected routes typically subnet associated with interface Ethernet0 is 129.95.1.2, then probably 129.95.1.0 associated with that i/f static route may be used for default route or to point to router that does not speak dynamic routes for whatever reason ip route 0.0.0.0 0.0.0.0 Ethernet0 (default route) Can add metrics to change the preference of the route and thus stay unused normally but be used if the other normal route goes away (because interface fails) ip route 0.0.0.0 0.0.0.0 Ethernet0 ip route 0.0.0.0 0.0.0.0 Serial0 10 normally don't use Serial0 (ISDN backup), but used if Ethernet0 goes away CIDR - a few other points Some interesting urls: 1. bgp routing table may be growing too fast again: CIDR report 2005 - 150k or so http://bgp.potaroo.net/cidr 1. given CIDR-ization; i.e., classless network POV .all routing protocols must change to support it. .by definition RIP v1 can't ... thus RIP v2 .route table entry must become (ip dst, netmask) and routing protocols must support shipping that info (RIP v1 and Cisco IGRP don't). .routing protocols that support CIDR include BGP-4 (up from BGP-3) OSPF RIPv2 EIGRP (new cisco igrp) 2. end systems SHOULD (but old one's will not) support CIDR in routing algorithm and routing table 3. The CIDR discovery is pre-IPnextgen ... but allocation of class C address range is still only 1/8th of space, therefore IPng addresses should address that problem. Schemes for class A allocation exist. 4. address renumbering ? implicit with provider-based allocation. If you change providers, you will have to change internal addressing. auto-discovery problem starts with own IP address (IP address, subnet mask, local default router, broadcast) own IP address provided by rarp bootp/dhcp bootp/dhcp can learn default router ICMP router discovery message advertisement/soliciation (used and extended in Mobile-IP to discover agent) dynamic routing protocols RIP sends default con: may be undesirable to have this info on leaf link as end systems are passive (not routers) and may be stupid (you don't want to run rip)