Chapter 16 Socket I/O read/write data write/writev/sendto/sendmsg read/readv/recvfrom/recvmsg socket layer i/o core: sosend/soreceive handle i/o between socket and protocol layers 16.2 code introduction files: sys/socket.h sys/socketvar.h sys/uio.h kern/uipc_syscall.c kern/uipc_socket.c kern/sys_generic.c - select kern/sys_socket.c select processing for sockets variables selwait - wait channel for select nselcoll - flag used to avoid races in select sb_max - max. number of bytes to allocate for a socket send/recv 16.3 socket buffers See figure 15.3 each socket has send/recv buffer. figure 16.3 sockbuf structure sb_mb points to mbufs sb_cc counts data as bytes - in mbufs sb_hiwat and sb_lowwat regular from flow control point of view sb_mbcnt - total amount of memory allocated to mbufs sb_mbmax - max size per protocol limits set when PRU_ATTACH issued, high/low water marks may be modified by process, as long as kernel hard limit of 262155 bytes not exceeded. Figure 16.4 default settings for IP protocols sb_sel structure used by select figure 16.5 sb_flags - values sb_timeo - timeout measured in hz, if 0 process waits forever may be changed with SO_SNDTIMEO or SO_RCVTIMEO socket options socket macros/functions - Figure 16.6 Figure 16.7 - macros/functions for socket buffer allocation and manipulation 16.4 - write-side calls write/writev/sendto/sendmsg write/writev/sendto built on top of sendmsg - most general call Figure 16.8 general layout of writing sendmsg calls sendit sendit/soo_write sosend note: PRU_SEND via pr_usrreq to talk to protocol figure 16.9 write system calls note writev/sendmsg ... are both scatter/gather Figure 16.10 iovec structure fill in and pass for multiple writes ... With datagram protocol: E.g., can pass in one packet *in order* with different buffers With datagram protocol: can call connect, then call write etc. sendxxx control flags, Figure 16.12 sendmsg can take "control info", Figure 16.13 msghdr structure 16.5 sendmsg system call sendmsg is the most general (has most features) and also biggest pain to call ... % man sendmsg ssize_t send(int s, const void *msg, size_t len, int flags); ssize_t sendto(int s, const void *msg, size_t len, int flags, const struct sockaddr *to, socklen_t tolen); ssize_t sendmsg(int s, const struct msghdr *msg, int flags); Figure 16.16 sendmsg args - socket descriptor pointer to mssghdr structure flags basically sendmsg function makes sure msg is copied in including iov part. then calls sendit basically sendmsg function makes sure msg is copied in including iov part. then calls sendit 16.6 sendit function uiomove function, int uiomove(caddr_ t cp, int n, struct uio *uio); move n bytes between a kernel buffer reference by cp and the multiple buffers specified by iovec array uio a very powerful scatter/gather primitive. Figure 16.7 uio structure uio_iov - ptr to an array of iovec structures uio_offset - count number of bytes xferred by uiomove uio_resid - number of bytes remaining to be xferred uio_rw - direction of xfer uio_segflg - where are the buffers in terms of user/kernel space buffers may be in: user data space user instruction space (loading code) kernel data space Figure 16.18 uiomove operation note READ means from kernel buffer to user space note WRITE means the opposite READ/WRITE at this level are a point of view or a vista ... but at base just a copy note that read from kernel space to kernel space is possible from kernel buffer to multiple buffers write from kernel buffers to a kernel buffer Figure 16.19 before: we do uiomove with a partial N (1st buffer, part of 2nd user buffer) a write from one kernel buffer to multiple user space positions figure 16.20 after: some of the potential i/o is done. sendit code: Figure 16.21 getsock gets us fp, a file we init struct uio audio note WRITE means: copy from user space into the kernel length of xfer calculated in for loop, sum saved in uio_resid iov_len cannot be negative uio_resid cannot overflow (signed integer) sockargs makes copies of dst address if provided (copy to to) sockargs again makes copies of control address info if any note to/control may be 0 call sosend to do real work sosend releases control, sendit releases to 16.7 sosend function all the write calls eventually end up here, therefore it is complex. sosend in very rough form (detail-free) while resid not existed (note this is byte-oriented) copy data from process make it into mbufs pr_usrreq via PRU_SEND E.g., we could be talking to this from netinet/udp-land Note the third field. It is udp's job to map the socket structure (udp as an example proto) to a pcb. It is not udp's job to make mbufs. ------------------------------------------------------ static int udp_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *addr, struct mbuf *control, struct proc *p) { struct inpcb *inp; inp = sotoinpcb(so); if (inp == 0) { m_freem(m); return EINVAL; } return udp_output(inp, m, addr, control, p); } ------------------------------------------------------ sb_hiwat and sb_lowat interpretation depend on whether underlying protocol is reliable or unreliable. reliable protocol buffering if reliable, send buffer holds both data sent but not acked, and data not yet sent usually 0 <= sb_cc <= sb_hiwat sb_cc number of bytes in the send buffer sosend must make sure there is enough space in the send buffer if PR_ATOMIC is set as a socket flag, sosend must preserve message boundaries must wait for space to send entire message when there is space, data put in mbuf passed on if PR_ATOMIC not set, sosend passes the message to the protocol one mbuf at a time and may pass a partial mbuf to avoid exceeding high-water mark (as with TCP) so socket layer actually may split TCP message (e.g., into clusters), and then TCP will split it again into MTU sized chunks. so_send will not pass data until size available is > sb_lowat, for TCP 2k. unreliable protocol buffering sb_hiwat is 9K for UDP. This size is large enough for an NFS write which is 8K, but dubious for efficiency with gigabit ethernet. sosend code sosend args: so - socket ptr addr - a ptr to a dest address in an mbuf uio - guess ... top - data to be sent as mbuf chain (remember it was 0 in the previous function) NFS can use this, normally 0 control - control info in mbuf flags - write options lock the send buffer - 1 at a time for top-half please sosend will wait for space if necessary, and go to restart to start over Figure 16.24 343-350 ... case where we have mbufs as a param, and uio is NULL else fill a single mbuf or mbuf chain, Figure 16.25 again if TCP we pass a single mbuf here 397-413 pass mbuf chain down if our buffers are full, we will loop ... cleanup unlock and free as approprirate Figure 16.23 init of sosend make sure local resid variable is set set dontroute flag (note: protocol must be atomic ... UDP si, TCP no) if (control) set length Figure 16.24 sosend function, error/resource checking if closed down return EPIPE error if we have an error (ICMP error came back) return that as UNIX error check to see if we are connected and should be connected may return ENOTCONN for TCP EDESTADDRREQ if udp sendto and no dst sbspace computes amount of free space if ATOMIC and message too big error is EMSGSIZE if not enough space in the send buffer unlock and block goto restart (because we have all of these error conditions) we can send with some data ... Figure 16.25 line 396 loop while atomic (note says space test is irrelevant) 351-360 literally: if top was 0 get hdr ... and init pkthdr parts else get mbuf book points out: if atomic set, get packet header on 1st iteration, and normal buffers after that. atomic not set, always allocate pkt hdr, as top is always cleared before entering loop basically try for cluster for data if that fails: limited to smallest of: 1. space in mbuf 2. number of bytes in message 3. the space in the socket buffer call uiomove to copy the data in to mbuf update local resid variable etc note: mp = ⊤ previous mp ptr set to top so deref of top is not suicidal Figure 16.26 summarize: call pr_usrreq with PRU_SEND in the normal case, neglect OOB rather paranoid blocking at splnet whilst that function is called ... top has data, addr has dst, control may have control (probably not) note: udp will immediately send the data on, no socket queue. tcp will queue on socket as we may have resend (for one thing) summary: ugggggly ... performance considerations top-half mbuf fill-in and pass on in parallel with device xmit may give some parallelism send buffer should be larger than bandwidth-delay product; i.e., it should not be the bottleneck e.g., tcp discovers connection can hold 20 segments before ack comes back ... send buffer should be able to hold 20 segments! two things: net.inet.tcp.sendspace: 32768 net.inet.tcp.recvspace: 57344 ttcp -b ... buffer size option is implemented how if (setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sockbufsize, or put more crudely ... don't cramp TCP's window style 16.8 read/readv/recvfrom/recvmsg once again it boils down to a couple of central routines that really only call one routine Figure 16.28 We look at recvmsg/recvit/soreceive 16.9 recvmsg system call % man recvmsg ssize_t recv(int s, void *buf, size_t len, int flags); ssize_t recvfrom(int s, void *buf, size_t len, int flags, struct sockaddr *from, socklen_t *fromlen); ssize_t recvmsg(int s, struct msghdr *msg, int flags); note msg is call by value/result. e.g., udp peer address can show up in there. args are: socket descriptor s msghdr msg flags copy msghdr structure into kernel with copyin note: aiov on stack for iov vector malloc space for iov structure based on if bigger than stack-based array iovlen param (welcome to TLV in syscalls) now copy in iov structure itself now call recvit updated msg copied back when done 16.10 recvit function Figure 16.30 getsock maps process filedesc, s int, to fp output (file table) auio initialized ... compute resid value Figure 16.31 call soreceive passing it a socket fp_f_data, address, uio vector, and various mbufs ... *retsize is number of bytes returned if any error, and some data was xferred, we ignore the error, and report the data if "output" in the sense that we have an address (recvfrom ... case) copy that out same for control data e.g., input interface free from mbufs ... we copied 'em out free control mbufs 16.11 sorecieve function xfer data from socket recv buffer to process buffer address protocol may specify address for peer, copy that to Figure 16.32 kernel flags that may be passed in MSG_DONTWAIT - nonblocking MSG_OOB - out of band data please MSG_PEEK - get the data without getting the data MSG_WAITALL - wait for data to fill buffers before returning Figure 16.33 flags that can be set on return in msghdr structure (recvmsg only obviousally) MSG_CTRUNC - more control data than buffer space ` MSG_EOR - end of logical record MSG_OOB - out of band data MSG_TRUNC - message was truncated Out-of-band data we can skip this BUT it should be noted that OOB may or may not be in-line with data. There is really only one data stream, with some bytes/records tagged as "special" high priority. OOB is still protocol-independent. udp does not have it, tcp does have it. There are two mechanisms for handing it 1. synchronization (read until flag) 2. tagging. send ... use flags MSG_OOB recv OOB we can put it in the socket oob buffer. process can get it by using MSG_OOB to read it. OR process can set SO_OOBINLINE socket option to get it with normal reads. MSG_OOB not used. Note: it is possible that OOB data has its own encoding: e.g, some sort of byte-stuffing-like encoding. E.g., you are reading ASCII, but you can get > 0x127, with some byte like 0xff indicating OOB status. Plus reads return either OOB data or not. msghdr flag set to tell the difference between them. recv. can also assert ioctl with SIOCATMARK after each read to determine if OOB data is marked. (boolean) Example, p. 506 Figure 16.34 two methods of recv. OOB data: 1. top out of band data is on side, and mark shows where it should go. note mark is AFTER. note: if read request > 9 bytes, socket layer only returns A-I (9 bytes) 2. bottom, process can read only bytes A-I, then can read on. Ironically: BSD TCP's urgent point data points AFTER data byte. All together now: Sigh. Other Receive options MSG_PEEK to look at read without consuming it. MSG_WAITALL - give me everything I asked for! NFS uses MSG_WAITALL MSG_DONTWAIT Receive buffer organization: if protocol supports message boundaries, then one message per mbuf chain (m_nextpkt) protocol processing layer adds data to socket recv. queue. socket layer removes data from recv. queue. guess which is top-half ... bottom-half ... if not PR_ATOMIC (TCP) proto layer shoves as much data in buffer as possible TCP discards data if out of room ... (ouch) UDP discards if not enough buffer space (presumably because the read-side process is out having a smoke?) PR_ADDR - tuple order (address, control, data ... data ... NULL) See Figure 16.35 udp note: that dest address is put first. note: number 3 with all fields present. Receive buffer organization: no message boundaries Ahem: TCP data trimmed to fit sb_lowat puts a lower bound on number of bytes received by a read system call TCP does not support control. Control info and OOB data sbinsertoob puts oob data before any other oob data but ahead of non-oob data. oob oob oob (say that fast 3 times) See figure 16.37: the point here is probably generality ... not protocol specificity. 16.12 soreceive code rule: process one record per call and try to return the number of bytes requested. Figure 16.38 function overview arguments: so - socket ptr **paddr - ptr to recv. address info uio OR mp0 == NULL if uio not-null ptr to uio structure if mp0 xfer recv buffer to mbuf chain **controlp - ptr to control info *flagsp - Figure 16.33 output flags pr is set to protocol switch structure init ... OOB processing (gray patch) restart: lock the buffer, wait if not available m ptrs to recv mbuf q wait for data to arrive (gray patch) if we wait here we jump back to restart: jumps to dontblock when there is data to satisfy the request nextrecord pts to 2nd chain in receive buffer process addr info and control setup data sxfer mbuf data xfer loop release: cleanup/e.g., unlock the buffer return priority Figure 16.39 OOB data get mbuf and call pr_usrreq PRU_RCVOOB copy out with uiomove Figure 16.40 ignore ... Figure 16.41 soreceive function, enough data? can the read system call be made happy by the data we already have? OR in general, wait for enough data to satisfy the entire read However: we may sometimes return with less than asked for. tcp quite often does this. if any of the following true, sleep: 1. no data, m == 0 2. not enough data to satisfy the entire read sb_cc < uio_resid (which is the cc we want) And the minimum amt of data is not available sb_cc < sb_lowat and we can put more data in (not ATOMIC) 3. not enough data to satisfy the entire read ... and MSG_WAITALL indicates we should wait p. 516 Figure 16.42 wait for more data? sorecieve must wait for more data if error code return it if read-half shutdown, but we still have data if m return the data now, don't wait (dontblock means don't wait!) else return 0 make oob check to prioritize it if we are not connected and we should be it's an error if resid is 0, rc should be 0 bail bail on nonblocking unlock the buffer and block on the rcv side signal check ... or error check ... Figure 16.43 return address information dontblock: maintain nextrecord so that we can reattach mbuf chain when this mbuf set has been "read" if we have an address to return (UDP) ignore PEEKING get that address off of the chain if we have a ptr to it set up p in which it is returned (the addr) if p is null, toss. Figure 16.44 process any control buffers note there is a loop here ... > 1 control buffer Figure 16.45 mbuf xfer setup Figure 16.46 xfer loop variables figure 16.47 1st part of mbuf xfer loop while we have mbufs AND data cc count and no error 1st if data type (OOB or not) changes break the loop (applies to TCP) ... if mp was NOT set use uiomove to copy out data note: protocol processing is allowed during uiomove else just pass back mbuf and adjust count Figure 16.48 soreceive: update ptrs and offsets for the next mbuf note: the mbuf may be discarded in lines 640/641 although in lines 647-657, we may not have consumed all the data mbuf data and len are updated accordingly note: len was how much we got out of it Figure 16.49: out of band data mark ... mostly ignore this, except not that if we are at the mark we set the socket flag SS_RCVATMARK Figure 16.50 soreceive function: MSGWAITALL processing stay until we get all the data (non-atomic) no data in the recv buffer (nextrecord is null, m is null) Figure 16.51: cleanup control returns to line 600, Figure 16.47 if ATOMIC and recv buffer was too small droptherecord if not a peek and m was used up link to nextrecord notify protocol, PRU_RCVD TCP uses this to update the receive window for a connection if all of that, and nothing was actually done go back to restart return any flags set Analysis whew. 16.13 select Figure 16.52 select: socket events note: errors are selectable events (makes sense) Figure 16.53, 1st half of select: struct select_args nd - number of descriptors fd_set ... read/write/oob timer zeroout ibits/obits on stack make sanity checks on nd convert nd (an int) to ni, ni is the number of bytes needed to start a bit mask with nd bits copyin in/out/ex bitsets if we have a timer, copyin its contents roundup to resolution of kernel hw clock with itimerfix compute the number of clock ticks to timeout in hz stored in timo Figure 16.65 select, 2nd half retry: loop until select is done nselcoll and P_SELECT are set for TH/BH communication if changed, then we selscan again scan the fds, looking for an event in selscan if error or *retval (any selector ready) goto done (return) if we have timer, and we timed out goto done something changed ... try again block with selwait as wait channel (string is "select") note that timeout can wake us up too as well as bottom-half event 0 52 1 10 2 0 456 324 select Is ?? 0:00.14 pccardd -f /etc/default 0 76 1 0 2 0 972 680 select Ss ?? 0:00.17 /usr/sbin/syslogd -s 0 83 1 114 2 0 1056 688 select Is ?? 0:00.00 /usr/sbin/inetd -wW done: clear PSELECT at done deal with signals copy out bits in/ou/ex Figure 16.55, p. 529, selscan function for every bit set in a bitmask set ... compute the fd descriptor and call the associated fo_select function for sockets this is soo_select ... said function is a boolean ... if "TRUE" set the fit in the mask outer loop loop thru the 3 masks loop thru 32 bits at a time bits ... convert to bits while loop ... convert bits to bit actually set with ff Figure 16.56 soo_select checks only one descriptor status case FREAD/FWRITE/0 oob ... readable basically means there is something in the socket input Q or read-half is closed or any connections pending or any error writeable, p. 531 for UDP can always write, for TCP, need free space > 2k (so_snd > sb_lowat) logic there is sort of: if TCP and > 2k ok if UDP, ok if any error conditions exist ... ok (you will get error) read/write call selrecord pass sb_sel, struct selinfo Figure 16.57 used to indicate > 1 process selecting on an event si_flags & SI_COLL (call this a collision) call this function when we find a descriptor that is NOT ready. This allows the BH to wakeup the process when ready. Figure 16.58 selrecord 532 if there is a pid (and it can't be us ... due to previous line) somebody else is waiting ... set collision else store pid selwakeup function Figure 16.59 ... various callers of selwakeup (read/write wakeups, etc.) E.g., in udp_input if (sbappendaddr(&inp->inp_socket->so_rcv, append_sa, m, opts) == 0) { udpstat.udps_fullsock++; goto bad; } sorwakeup(inp->inp_socket); append to the recv queue and kick the upstairs proc/s... What happens if packets are coming in "infinitely" fast? --------------------------------------------------------- Figure 16.60 selwakeup if collision ... wakeup on selwait (wakeup 1..n procs) find proc pointer turn OFF pid make that process runnable