chap 15: Socket Layer 15.1 intro 4.2 BSD sockets net/3 based on 4.3 Reno 1. protocol declared with socket() call, 2. but socket interface itself protocol-independent 3. read/write work with network connection with UNIX not with windows, even though windows has winsock read/write work (with TCP) compared to sendmsg/recvmsg which can be seen as the more complex lower-level forms of "read/write" focus here: socket implementation at the top as system calls keep in mind: protocol "independence" odd support of non-blocking i/o even though no sane party cares about it protocols must still *work* in spite of it all note: layout of book ... need sockets (e.g., pcbs) before transport layers Figure 15.1 socket system calls seen as library stubs layer in kernel to support that actual functionality thru pr_usrreq etc. SPP is old XNS stream transport TP4 is one of the ISO stream transports splnet processing many calls to splnet/splx ... blocking out L4/L3 splnet side, not L2 side. 15.2 code introduction files: sys/socketvar.h - socket structure defs kern/uipc_syscalls.c - system call implementation kern/uipc_socket.c - socket-layer functions 15.3 socket structure socket: one end of communication link which protocol state info for protocol including addressing queues of arriving connections (e.g., TCP) data buffer queues in/out option flags Figure 15.4 socket structure so_type: SOCK_STREAM, etc. so_options: collection of flags: See figure 15.5 SO_DONTROUTE - bypass routing table SO_REUSEADDR/SO_REUSEPORT - velly curious SO_USELOOPBACK - routing socket, get your own messages so_linger - time in seconds to hang around for data to drain whilst closing connection so_state - internal state, see Figure 15.6 by default block on reads read - block in top-half until bottom-half gives data write - block in top-half if no buffer space SS_NBIO - non-block ... return with EWOULDBLOCK error if blocking would occur SS_ASYNC - signal driven i/o, sends SIGIO signal to so_pgid when status changes because: . a connection request has completed . disconnect has been initiated . disconnect completed . half of data connection shutdown . data arrived . data has been sent freeing up buffer space . async error has occured so_pcb - pcb that is protocol specific, each defines its own pcb. but so_pcb is generic ptr Figure 15.7 protocol control blocks UDP - struct inpcb TCP - struct inpcb, struct tcpcb ICMP, IGMP, raw IP - struct inpcb Route - struct rawcb so_q0 e.g., TCP connections but connection not yet complete so_q - established connections so_qlimit usually 5 Figure 15.8 2 queues illustrated note: accept "harvests" connections on so_q queue so_timeo is "wait channel" used during accept/connect/close so_error holds error code until it can be reported (might be async generated) so_oobmark ids point in input data stream at which OOB data was received so_upcallarg are used by NFS process-level app moved into kernel triggers process-level action on arriving data NFS executes within kernel ... 15.4 system calls 1st basic system call intro exactly how we get user/kernel mode switch is machine dependent (trap mechanism) system calls are numbered. mechanism must pass index to system call jump table from user/kernel mode syscall is the function that gets the index arg indexes sysent table struct sysent int sy_narg /* number of args to function */ int (*sy_call) [] Figure 15.9 socketvar.h/uipc_socket2.c summary so_pcb - pcb pointer so_proto - protocol handle so_proto - proto sw pointer so_head - connection accept socket so_q0 - partial connections so_q1 - incoming connections ... blahblahblah ... buffer/queues for i/o struct sockbuf sb_mb - mbuf chail one for recv/one for snd upcall pointer ?! note: recvmsg/sendmsg are the real system calls, other relatives are handled in libc (send/sendto) syscall copies args into kernel space, allocates array to hold results, returns when syscall is done error = (*callp->sy_call)(p, args, rval); p is ptr to proc table rval is array of 2 32-bit words to hold the return value system call in this context from now on means "function in the kernel that handles system call" syscall functions must return 0 if no error, else return errno value ... which must be carefully shoved back into user space (libc library global int errno) appl-call convention is rc = read() returns -1 if error socket prototype int socket(int domain, int type, int protocol) struct socket_args ... internally socket call is socket(struct proc *p, struct socket_args *uap, int *retval) Figure 15.11 networking system calls fcntl is normally used for file i/o, e.g., dups/locking, but here can be used for setting O_ASYNC or O_NONBLOCK note: getsockname/getpeername ... essentially getting pcb info (addresses/ports) for an open socket getsockname - get local info getpeername - get remote/peer info 15.5 processes/descriptors/socket Figure 15.13 proc points to filedesc table which has device switch appropriate to descriptor type files/f_ops --> socket i/o handles as opposed to file i/o handlers PRINCIPLE OF INDIRECTION LURKS HERE it used to be files, now files/sockets thus read can work in both worlds socket itself has device driver setup i.e., protosw any system call -- 1st arg is p or proc pointer p.p_fd points to filedesc structure, which manages per process dynamically sized descriptor table file - open file table shared between processes f_ops/f_data f_ops - list of function pointers (socket ops in this case) f_data points to socket structure associated with the descriptor 15.6 socket system call itself int sd = socket(domain, type, protocol); AF_INET, SOCK_DGRAM, particular protocol (AF_INET has UDP) before system call structure dfeined to pass args socket_args See Figure 15.14 socket call system calls have 3 args: p - current process (very very top-half) uap - pointer to arg structure retval - value/result that oints to return value for syscall falloc allocates file table plus slot in fd_ofiles note fp/fd then set it up to point to socketops socreate inits socket structure cleanup on failure note f_data set to point to socket *retval set to fd which is pointer into process file table figure 15.15 fileops structure as setup with socket functions soo_read soo_write soo_ioctl soo_select soo_close Figure 15.16 socreate function note socket does "high-level" work ... socreate does low-level work thus socreate can be called elsewhere inside kernel e.g., NFS which needs sockets too this is separation of policy and mechanism args: dom - family aso - return new socket ptr ... this is why ** type - socket type proto - proto type note *p set to curproc ironically try pffindproto or pffindtype if no proto (0) malloc socket zero it to set type set proto protocol specific request pr_usrreq exists to handle requests from socket layer call pr_usrreq for protocol to do PRU_ATTACH set *aso to so See Figure 15.17 for pr_usrreq request list PRU_ATTACH means; "alert! alert! a new socket has been created" e.g., one thing udp does here is allocate its PCB structure that holds ports/ip addresses. Figure 15.18 function/privilege table e.g., setting an IP address takes root permission. raw sockets take root permission. 15.7 getsock and sockargs functions helper functions: getsock - map descriptor to file table entry sockargs - copy args from process to newly allocated mbuf Figure 15.19 getsock given fdes/fdp, return fpp checking for errors Figure 15.20 sockargs note type of structure in terms of malloc is passed in get mbuf copyin from user space args of no error, set length as appropriate if address, MT_SONAME set socket length sockargs e.g., used by bind to copy sockaddr_in into mbuf 15.8 bind system call bind - associate local address with socket. clients don't usually care. servers do usually care. TCP specs foreign address (peer) with connect. UDP does it implicitly usually via sendto although connect is possible. Figure 15.21 bind call getsock with filedes/handle and get struct file *fp call sockargs with 2nd arg/3rd arg, and get back nam ... stored address call sobind to do internal work free nam buffer sobind, Figure 15.22 splnet call pr_usrreq PRU_BIND handing it nam splx so udp would set server port, IP address for server (which is non-trivial) 15.9 listen system call, Figure 15.23 get fp call solisten to do work solisten function, Figure 15.24 PRU_LISTEN to do work udp has no listen call if no q setup, mark to accept connections set backlog to 0 if negative (idiot at the wheel) set qlimit to min backlog, 5 15.10 tsleep/wakeup top-half may have to block because kernel resource not available tsleep(wait-channel ... frobish-kabosh# ps -alx UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND 0 0 0 0 -18 0 0 0 sched DLs ?? 0:00.00 (swapper) 0 1 0 0 10 0 556 324 wait SLs ?? 0:00.01 /sbin/init -- 0 2 0 0 -18 0 0 0 psleep DL ?? 0:00.02 (pagedaemon) 0 3 0 0 18 0 0 0 psleep DL ?? 0:00.00 (vmdaemon) 0 4 0 0 -18 0 0 0 psleep DL ?? 0:00.08 (bufdaemon) 0 5 0 0 -2 0 0 0 vlruwt DL ?? 0:00.07 (vnlru) 0 6 0 0 18 0 0 0 syncer DL ?? 0:00.67 (syncer) 0 26 1 120 18 0 212 88 pause Is ?? 0:00.00 adjkerntz -i 0 52 1 1 2 0 456 324 select Is ?? 0:00.11 pccardd -f /etc/defaults/pccard.conf 0 76 1 0 2 0 972 680 select Ss ?? 0:00.18 /usr/sbin/syslogd -s 0 83 1 117 2 0 1056 688 select Is ?? 0:00.00 /usr/sbin/inetd -wW 0 87 1 59 2 0 2084 1524 select Is ?? 0:00.27 /usr/local/sbin/sshd 0 139 1 0 2 0 920 520 select Ss ?? 0:18.14 moused -p /dev/psm0 -t auto 0 174 1 124 2 0 972 704 select Is ?? 0:00.03 lpd 0 203 202 4 2 0 10128 9172 select S ?? 5:00.66 X :0 -nolisten tcp 0 547 1 0 2 0 1168 844 select Ss ?? 0:00.00 mipd 0 211 206 0 3 0 1332 964 ttyin Is+ p0 0:00.06 csh 0 217 1 0 2 0 3976 3164 select S p1- 0:06.17 xterm mipd/X server/inetd blocked on select ... waiting to read select() error = tsleep((caddr_t)&selwait, PSOCK | PCATCH, "select", timo); select .... multi-descriptors read on descriptor tsleep(caddr_t wait-channel, int pri, char *mesg, int timo) chan - wait channel - which event are we waking for, usually a function or buffer address called by wakeup wakeup(caddr_t chan) all procs wakeup and "contend" for the channel ... meaning the next one in-line in the RTR queue gets the cpu pri - wakeup at this priority, can also return when signal arrives mesg - string id call to sleep for debug shows up in ps timeo - wakeup in N clock ticks return codes: see Figure 15.25 EWOULDBLOCK means here "you timed out ... before the event" EINTR - e.g., alarm timer will cause this or control-c SIGINT for that matter tsleep calls appear within a loop must recheck condition ... and sleep if not satisfied race condition with multiple processes contending for it Example: we might have multiple processes reading from the same UDP socket (NFS) nfsd(4) ... 4 processes each server calls recvfrom blocks in tsleep until data shows up bottom-half calls wakeup 1st to run gets datagram others call tsleep again 15.11 accept server call accept is tricky! TCP - 3-way handshake is complete ... then returns. TP4 - any connection request has arrived. process-level must get involved and do some i/o. Figure 15.26 accept args - remember accept is call by value/result ... we return peer address info on successful return copyin to uap structure if name set use getsock to get fp set splnet make sure listen was called ... if non-blocking and qlen is 0 (no connections) return EWOULDBLOCK sleep while no connections accept is not restarted by default after a signal (EINTR is returned) now netcon is "accept" on return there is a connection in so_qlen falloc to get new descriptor setup soqremque removes socket from conn. queue setup the "new" connected one soaccept is called to do protocol processing on the new one, get name of foreign socket *************************** note so, nam associated with new one if input address call was non-zero copy out socket and length free nam note: one mbuf here limits size of socket structure. UNIX socket must fit. soaccept function basically PRU_ACCEPT call all over again after pr_usrreq returns, nam contains name of foreign socket p. 461 15.12 sonewconn and soisconnected functions remember in socket structure: so_q0 e.g., TCP connections but connection not yet complete so_q - established connections Figure 15.28 incoming tcp connection processing accept processes requests through so.so_q 3-way handshake NOT complete ... connection in so_q0 when complete in so_q accept "harvests it" sonewconn create new socket soisconected updates new socket on final ACK of handshake, moves connection to so_q, issues wakeup for blockers Figure 15.29 sonewconn function incoming initial TCP SYN from potential peer head is socket pointer connstatus for TCP == 0 fudge factor limit on connections ... alloc new socket fill in ... options from setsockopt may be inherited note head field in general inherited (socket clone function of accept) soqinsque ... inserts in head at so_q0 because connstatus is 0 PRU_ATTACH called ... wakeup could occur here if connstatus is non-zero, not true for TCP Figure 15.30 soisconnected with TCP ... on input side, driven by final ACK ISCONNECTING states off ISCONNECTED on move from so_q0 to so_q issue read-side wakeup (sorwakeup) select on connection (which is done with read bits) 2nd wakeup for accept blocking in 2nd case, head is null ... we are doing connect wakeup blocer in connect sorwakeup/sowwakeup ... wake up any selects read/write wake them all ... 15.13 connect system call with TCP ... initiate 3-way handshake ... get peer address info kernel must choose addresses (IP/local port), if this wasn't done with bind. UDP/ICMP connect records the foreign address which can have some use. E.g., possibly you can use write ?! Figure 15.31 functions for connect processing LHS/UDP RHS/TCP with UDP, note no traffic on the wire. with TCP .start 3 way handshake .sleep .wakeup when done soisconnected called here Figure 15.32 connect system call note non-blocking mechanism shows you call connect until you don't get EALREADY ... copy in sockargs call soconnect to do the real work non-blocking may return EINPROGRESS if not done to avoid blocking sleep loop until connected Figure 15.33 soconnect function listen has been called THEREFORE if you are accepting you shouldn't have come here if we are connecting or connected disconnect return EISCONN else call pr_usrreq PRU_CONNECT obscure socket feature: note: UDP ... if connected, can break connectino by calling connect with invalid name such as ptr to structure filled with 0s. 15.14 shutdown call close: 1. the write side 2. the read side 3. both sides (socket not terminated though) call close: to destroy socket and release file descriptor Figure 15.34 shutdown 1. getsock 2. call soshutdown to do the work, passing in uap->how and the socket itself Figure 15.35. values of how/how++ Figure 15.36 soshutdown function read-shutdown done by sorflush(so) write side pr_usrreq PRU_SHUTDOWN note: both might be done note: socket layer does this work Figure 15.37 sorflush function overall: discard data on read-side and disable read functionality non-interruptible sblock ... is socket buffer lock ... lock read buffer block bottom half utterly call sbcantrcvmore on socket unlock save data in asb for obscure unix side rights stuff zero out buffer ptr ... do rights stuff ... chain was in asb ... now free it TCP does shut down of write-half of connection via PRU_SHUTDOWN TCP sends all queued data, and does a FIN 15.15 close system call object specific close is called ... i.e., soo_close Figure 15.38 soo_close calls soclose! sets socket ptr in file land to zero Figure 15.39 soclose function traverse connection queues and call soabort in each case if no pcb, go to discard if we are connected if we are not disconnecting call sodisconnect note: nonblocking socket (normal case) will go straight to drop code assuming LINGER is not set (probably normal case) if LINGER option set note: this means we are waiting for async disconnect to occur while connected sleep ... if socket was not connected we are drop: if pcb call PRU_DETACH PRU_DETACH: formally break socket/protocol connectivity ... do whatever has to be done. discard: sofree frees socket Figure 15.40 sofree function bail if pcb still exists or we still have file descriptor if socket on connection queue ... remove it discard buffers in send queue (sobrelease) discard buffers in recv queue (sorflush) release socket itself