chap 22: Protocol Control Blocks (pcb) 22.1 intro ip has a pcb. udp does not have one as it just uses the ip one. tcp of course has one, and has its own structure as well per connection. IP pcb contains common info for udp/tcp/raw ip. not used by ip layer. It's a transport structure. foreign/local ip, foreign/local port. ip header prototype. ip options. pointer to routing table entry. (cache) route cache: Very common caching technique. But it means the bottom must notify the top ... if say a route changes due to an interface going down. functions mostly start with in_pcb...() Figure 22.1 note that udp and tcp pcbs are chained together in a udp list, and a tcp list WHY? note in tcp, pointer called inp_ppcb points to tcp control block tcp/udp global ptrs called, tcb and udb ... each has next available port stored in it 22.2 code netinet/in_pcb.h inpcb structure netinet/in_pcb.c pcb functions # vmstat -m - shows kernels memory alloc stats, # netstat -m - mbuf stats 22.3 inpcb structure ports/ip addresses in network byte order note that ICMP redirect causes scan of all pcbs, some of which may have invalid cached route (just changed) (marked invalid, has to be looked up again next time used) Figure 22.5 inp_flags ip header is stored, but only two parts used. ttl/tos. 22.4 in_pcballoc and in{pcbdetach when socket is created, pcb is allocated by tcp/udp/raw ip. PRU_ATTACH is issued by socket. ... in udp case, we get in_pcballoc(so, &udb) Figure 22.6 in_pcballoc function malloc zeroout set pointers stick in queue glue to socket Figure 22.7 in_pcbdetach dequeue and free etc. 22.5 binding, connecting, and demultiplexing binding of local ip address and port number Figure 22.8 shows six different combos of ip address/local port number that a process can specify in a call to bind. 1st 3: well-known port last 3: ephemeral port note 0 means: kernel choose (ftp uses that) note multicast bind of port N means packets say to 224.1.2.3, port 1111, not any port at that G address. what happens if port 666 in use, and another (root) process wants it. error EADDRINUSE definition of "in use" is relative. 1. a pcb exists. 2. different for udp/tcp. SO_REUSEADDR option: process may reuse port in use, but not ip address including 0 SO_REUSEPORT: reuse both IP address and port, but each binding must specify this option. one goal with latter: two multicast procs can get pkts for same port, group. Connecting a UDP socket udp socket may call connect and specify peer ip, port pair socket can only exchange data with that peer Figure 22.9 3 different states of UDP socket 1. connected to 1 peer 2. bind local IP 3. bind local port only Demux of TCP input Figure 22.10 state of 3 telnet server sockets when TCP gets segment that has port 23 as target searches pcbs with in_pcblookup routine in_pcblookup - 4 possible numbers prefers smallest number of wildcards local port numbers must match number of wildcard matches can be 0, 1 (local ip, or foreign ip) 2 (both above) Demux of UDP input UDP input can come to broadcast/multicast too if we have sockets that can have identical local ip/port, then what? 1. broadcast/multicast then sent to ALL matches no best match 2. incoming udp datagram for unicast only goes to one. Figure 22.13 4 udp sockets with the same local port 577 Figure 22.14 pkt to 140.252.13.63 (bcast) sent to TWO sockets Figure 22.15 unicast example 22.6 in_pcblookup 4 purposes: 1. BH: in_pcblookup scans pcb list for matching tcb . protocol layer demux ... this gives us the socket. 2. TH: bind to assign local ip/local port, make sure address pair not in use. 3. TH: bind ... request ephemeral port, check to make sure not a mistake. 4. TH: connect ... implicit or explicit, verify uniqueness of socket pair. two options make things more confusing: 1. SO_REUSEADDR 2. SO_REUSEPORT Figure 22.16: in_pcblookup() note args loop thru the list provided if local port doesn't match, continue set wildcard to 0 figure out if we should set wildcard to some value 1 or 2 See Figure 22.17 for 4 possible situations with local wildcards Figure 22.18 foreign wildcards Check if wildcard match allowed 438 check if caller allows wildcard lookup remember best match so far ... which means FEWEST wildcard matches Example - dumx of received tcp segment Figure 22.19 pkt from 140.252.1.11, port 1500 to 140.252.1.29, port 23 note: they *all* match last match is best (most definite match is best) 22.7 in_pcbbind function called from: 1. bind for tcp 2. bind for udp 3. connect for tcp socket 4. listen for tcp socket, if socket not yet bound to nonzero port 5. in_pcbconnect if local ip/local port not set yet, typical for udp client. 1,2 are called explicit binds. rest are implicit binds. servers MAY use ephemeral ports; e.g., NFS servers as they tell the portmapper (port 111) what port they got. Figure 22.20 bind a local address and port, section 1/3 if no ip addresses fail if already set, invalid if not using REUSE* and (rest doesn't work ...) set wild to wildcard value note comment in text ... Figure 22.22 process optional nam argument, section 2/3 nam is set for explicit bind calls nam contains sockaddr_in in mbuf See Figure 22.21, 4 cases here: cast socket ptr check length #ifdef notdef means it is ifdef'ed out but "commented in" if multicast if reuseaddr is set, set reuseport too else if not wildcard check that address actually belongs to one of our interfaces if lport set (not wildcard) check reserved port semantic < 1024 belong to root THIS IS NOT IETF, BUT BSD/UNIX semantic. Usoft might not care. (historically velly sticky wicket here for firewall setups as a result) in_pcblookup calls to check if ok. note: 2nd arg is basically faddr set to 0 (wildcard) 3rd arg is 0 (foreign port set to any) causes in_pcblookup to ignore peer ip/port. only check local port/ip. reuse check which returns error if not on and port already allocated caller's value for local ip address stored in pcb. this may be wildcard address (any). Figure 22.23 choose an ephemeral port, 3/3 note: next port number per protocol is maintained in head/pcb list. (in host byte order!) if lport is 0 loop start at 1024 and increment by 1 until port 5000 is reached, then go back to 1024. whilst not in head list ... check with in_pcblookup put the port in inp->inp_lport SO_REUSEADDR examples note: with TCP port 23, and one server running already and with SO_REUSEADDR set, still an error. *************** we could however start ok with two different local ip addresses. Figure 22.24 rules for SO_REUSEADDR note that it is useful in one fringe case. SO_REUSEPORT works if all the servers use it. 22.8 in_pcbconnect specify foreign ip and foreign port for pcb/socket. called from 4 places: 1. connect for tcp (normally tcp client) 2. connect for udp (could be done by udp client or tcp server) 3. sendto ... udp 4. from tcp_input when a SYN shows up. common for local ip/port to NOT be known at this time. in_pcbconnect will allocate them as a side effect. Figure 22.25, verify args, check foreign ip, section 1/4 nam contains the foreign socket can't connect to port 0 ... (local wildcard ...) if we have ip addresses if connecting to 0.0.0.0 use ip of 1st interface if connecting to 255.255.255.255 use directed bcast of 1st interface NOTE: this is why a udp sendto to 255.255.255.255 may not go out the i/f you want. you can either 1. send to directed broadcast (which implicitly specs the if) 2. use bpf to latch onto i/f and send 255.255.255.255 by bypassing stack Figure 22.26, local ip not yet specified if local ip is 0 if we have a route (cached in pcb) and dst is NOT the input foreign address or DONTROUTE free the route if no route yet try to get one ... based on foreign ip lookup if we got a route, use that route's address goal: ia pts to interface address if no address, try matching ip with interface or 1st ip in list Figure 22.27 dst is multicast if multicast and we have options get ifp from multicast options try to find address for that ifp set ifaddr from ia ... Figure 22.28 verify that socket pair is unique call in_pcblookup to verify that socket pair is unique note: we either had local address already or value from ifaddr local port can be 0 ... epheremal port chosen later note: this test prevents two tcp connections with the exact same 4-tuple also prevents two "long duration" udp sockets from going to foreign socket from same local socket if local ip is wildcard (implicit bind case) if local port is 0 call in_pcbbind to get it local ip set to interface ip set foreign ip set foreign port ip src vs outgoing interface address inp_laddr is used by tcp/udp as src address. can be set to ip address for any interface by bind in_pcbconnect assigns ip/local only if was a wildcard and in that case is for outgoing interface therefore outgoing pkt may have ip src that does not match outgoing interface ip 22.9 in_pcbdisconnect udp sockets are disconnected via this function. removes any foreign address association. foreign socket set to 0. pcb is released when there is no longer a file table reference (SS_NOFDREF) 22.10 in_setsockaddr/in_setpeeraddr getsockname - returns local ip/port getpeername - returns foreign ip/port in_setsockaddr in_setpeeraddr do work. Figure 22.30 in_setsockaddr return socket from pcb Figure 22.31 in_setpeeraddr return socket from pcb 22.11 in_pcbnotify, in_rtchange, in_losing functions in_pcbnotify - called when ICMP error received. notify process of error. e.g., ICMP src quench must slow TCP down. Figure 22.32 summary of processing of ICMP errors redirects are handled differently. handed to tcp/udp both as they may have cached routes. protocol defines control input function: pr_ctlinput in protosw tcp: tcp_ctlinput udp: udp_ctlinput Figure 22.23 in_pcbnotify function. called by tcp: 1st arg is address of the tcb, final arg is address of function tcp_notify. for udp: 1st arg is address of udb, and last arg is udp_notify. sanity check args dst should be faddr, if 0, return. if error is redirect ... ports are nulled to avoid for loop following doing successful comparison notify function is in_rtchange redirect note: we want to select pcbs based only on foreign address global: inetctlerrmap maps protocol-independent error code to UNIX errno. loop thru pcbs if faddr doesn't match ETC continue it matched ... advance inp call notify function Figure 22.34 in_rtchange Figure 22.34 in_rtchange ... invalidate route if we have a route free it Redirects and raw sockets raw socket code does not have control input function. cannot be notified about routing redirect. cached route is not released. icmp errors and udp sockets udp must connect for icmp errors to make it up to app layer. if process has NOT connected, pcb inp_faddr/inp_fport both zero, therefore in_pcbnotify cannot call notify function. why: for one thing ... if you don't register the foreign socket, and you sent 3 pkts, to 3 different destinations, and you merely get back errno ... can't tell which one had the error. in_losing function Figure 22.36 in_losing function: invalidate cache route info TCP calls this when retransmit timer, has expired for 3rd time in a row. we have pcb, if we have route. then discard that route. use rt_addrinfo to fill in with info on route that is failing. rt_missmsg called to generate message to routing socket of LOSING type, which can be used by routing daemon. rtrequest deletes old route. cached route released. next use should allocate new route 22.12 implementation refinements tcp/udp both maintain a last ptr for pcb lookup hash table would be better idea 22.13 summary pcbs used with every socket tcp/udp/raw. contains addresses, ports, cached route. in_pcblookup called to map addres to socket, taking wildcards into account. in_pcbbind binds local address, port to a socket. in_pcbconnect sets foreign address, port. Figure 22.37 summary of in_pcbbind/in_pcbconnect