2. Mbufs: Memory buffers 2.1 intro history: 128 byte buffers designed on VAX as VAX did not have much memory. It has been suggested that they are overly complex, and simpler buffers schemes in more memory rich systems make better sense. write/encapsulate and prepend headers read/decapsulate and remove headers efficiency note: explain about zero-copy ... and about how device drivers MAY work in terms of memory note: #netstat -m note: open sockets cause allocation of mbufs mbufs: data to/from userland, socket addresses, options, note use in setsockopt e.g., and other places Figure 2.1 4 kinds of mbufs note 20 byte header, 108 bytes data possible in mbuf itself mbuf + data note mbuf may NOT use data portion, point to cluster instead note mbuf chain may exist + pointer may allow mbuf chains to be glued together as > 1 packet 1. m_flags = 0 108 bytes of data m_data may point anywhere, and may be moved to decapsulate m_len specifies how much data, m_data + m_len is end ... 2. m_flags has M_PKTHDR set, therefore 1st mbuf in mbuf chain. pkthdr itself is first chunk of data m_pkthdr.len/m_pkthdr.rcvif m_pkthdr.len = total length of packet m_len this mbuf 3. cluster in use; that is, M_EXT is set. Note: 2048 is big enough for an ethernet MTU and is commonly used. Mogul performance discovery: 1024 for TCP/MTU seen as memory vs performance tradeoff. 4. cluster in use: pkt header. m_flags = M_PKTHDR | M_EXT cluster performance question: if at all possible, do not mbuf IN to user-space, swap page pointers. Output however could be very tricky. ... (we need to prepend headers). note: udp allows 0 length data, therefore m_len == 0 is possible. note: m_ext.ext_buf = address of buffer m_ext.ext_size = ... size ... note: m_next - mbuf chain m_nextpkt - packet chain (1 pkt == 1 or more mbufs) Look at Figure 2.2 --------------------------------------------------------------- 2.2 Code Introduction sys/mbuf.h kern/uipc_mbuf.c global variable mbstat ... and variables sysctl -a shows: kern.ipc.nmbclusters: 2000 kern.ipc.nmbufs: 8000 kernel uses mbstat with fields to track various usage items. internally increments/decrements i++, i-- or assigns i=j process uses namelist and random access thru /dev/mem to get at kernel structure to read it out. Linux has tendency to do this /proc ... anymore BSD use more limited. 2.3 mbuf definitions /usr/include/sys/sys/mbuf.h /usr/include/machine/param.h actually sys/param.h point out "problem" in this regard. 2.4 mbuf structure note macros at the bottom ... intended to simplify union mess m_flags: M_BCAST ... link-level broadcast flag M_EOR used by XNS not tcp M_EXT - external cluster M_MCAST - multicast M_PKTHDR - 1st mbuf in possible chain M_COPYFLAGS ... used if mbuf chain copied figure 2.10 - m_type ... data stored in mbufs is typed. actually all data dynamically allocated is typed by kernel malloc. MT_CONTROL - extra-data protocol message MT_DATA - data! MT_FTABLE - fragment reassembly header MT_HEADER - packet header MT_SONAME - socket name MT_SOOPTS - socket options (setsockopt) vmnstat -s 2.5 simple mbuf macros/functions some "functions" are macros to make them in-line. space vs. time tradeoff. m_get nowait = M_WAIT | M_DONTWAIT if you don't have space, return immediately. note drivers will use this. ... can't block. socket layer blocks as it can block process. note: even M_WAIT can fail. ENOBUFS ... M_GET m_retry per protocol "drain" function drain ... ip e.g., can discard reassembled packets. tcp doesn't do anything. udp has no drain func. M_GET called again. mbuf locking MALLOC splimp at beginning block drivers splx at end MCLALLOC/MCLFREE cluster alloc/free do something similar 2.6 m_devget/m_pullup m_pullup used to guarantee the tcp header, ip header, etc. is in the mbuf you are looking at, not somehow spread over into the next mbuf if not, copied into new mbuf. m_devget figure 2.14. left-hand-side: data between 0..84 (tcp control) 16 bytes left because ethernet header MAY be stored in there and ip header is word aligned. case 2: no room for ethernet header. case 3: 2 mbufs because of extra data. case 4: cluster (normally only thing done at this point) mtod/dtom macros mbuf to data, data to mbuf mtod - pointer to mbuf data dtom - given data ptr, give me ptr to mbuf itself dtom cannot be used with clusters. m_pullup function and contiquous protocol headers: 1. used to make sure data in mbuf >= protocol headers size (ip, etc.) note failure possibilities include 2: 1. not enough data (packet too short, although not likely to be physically too short) 2. no mbufs and needs to do pullup m_pullup and ip fragmentation/reassembly used in ip reassembly and tcp reassembly fragments kept in doubly linked list, using ip src/dst in ip hdr to hold forward/backward list pointers what if cluster used though, then ip src/dst in cluster, not mbuf dtom cannot be used, no back pointer from cluster to mbuf hdr/mbuf m_pullup always called ... forces ip header into its own mbuf always moves 40 bytes (ip + tcp hdr) tcp avoids doing this. tcp data is either BIG (MTU sized) or small, (ftp/email/web) segments with 10 bytes of data (telnet/slogin/web) mbuf pointer stored in tcp header ... 2.7 summary of mbuf macros/functions MCLGET - get a cluster. set data pointer of m (mbuf hdr at least) to cluster note: mbuf is needed a priori MFREE - free single mbuf, if M_EXT set ... note: cluster has reference count which is decremented but not freed, until count == 0 note m_freem - frees entire packet (mbuf chain) MGETHDRZ - allocate mbuf and init as packet hdr. M_PREPEND - mv len bytes of data in front of mbuf data if room exists, then we just manipulate ptr, else allocate new mbuf and fix pointers used e.g., if encapsulation done dtom/mtod figure 20: mbuf functions m_adj - if len positive, trim bytes from FRONT of data, else trim from end. m_cat - glue one chain to another m_copy - give me a copy of a chhain m_copydata - give me a copy of len of data m_copym - normal mbuf copy, with offset for start ... m_devget - create new mbuf chain with pkt header, and return ptr to chain. device driver (old) might use this. 2.8 summary of net3 networking data structures 1. mbuf chain. linked thru m_next 2. packet list ... linked thru m_nextpkt e.g., socket send/recv buffer. ip input. 3. above as linked list, in queue (head/tail) 4. doubly linked, circular list ip fragmentation/reassembly, pcbs, tcp's out of order segment queue use this insque/remque used here. These were actually CISC primitives in the machine instruction set for the VAX. 2.9 m_copy and cluster reference counts. clusters pros include: 1. 1500 byte packet would need lots of small mbufs ... 2. allow sharing of data 3. allow page ptrs to be moved ... so data need not be copied. assume app writes 4k to tcp socket. one cluster filled with 2k of data. tcp send router appends mbuf to send buffer calls tcp_output tcp must prepend small mbuf with room for ethernet hdr, ip, (remember that there ip pseudo-hdr function) i.e., tcp *routes* and performs its own checksum tcp hdr tcp checksum so tcp asks interface to queue data ... if sends, tries to delete mbuf chain hdr deleted, but not cluster with data ... tcp_output actually does m_copy to copy the data m_copy prepends a header but cluster is shared, not copied. ethernet driver calls m_freem post transmit, HOWEVER ... this only releases the prepended header part, decrements cluster ref. count tcp has to decided via returned ack that data has been acked ... and then do a free note: on write-side of stack, only need to prepend one mbuf. TBD? where exactly is cluster reference count? TIMEOUT: for story 2.10 alternatives/criticism mbufs designed when memory was scarce ... no longer the case. this is why modern drivers will allocate a cluster for everything take note of what socket allocation code does when we get there ... 2.11 summary 4 kinds of mbufs, depending on whether M_PKTHDR/M_EXT used. 1. no packet header, 108 bytes of data 2. packet header, 100 bytes of data 3. no packet header, with cluster 4. packet header, with cluster ------------------------------------------------------------------ OK, explain this: 4.7 BSD: /sys/dev/wi/if_wi.c ... Lucent/prism2 802.11b device driver: loadable modules (device drivers) are now placed in /sys/dev stored in /modules, and can be loaded with kldload at boot. This one is lucent/orinoco/prism2 802.11 device driver. # kldload # kldstat Can also be statically linked into kernel. --------------------------------------------------------- wi_rxeof(sc) ... /* first allocate mbuf for packet storage */ MGETHDR(m, M_DONTWAIT, MT_DATA); if (m == NULL) { ifp->if_ierrors++; return; } MCLGET(m, M_DONTWAIT); if (!(m->m_flags & M_EXT)) { m_freem(m); ifp->if_ierrors++; return; } m->m_pkthdr.rcvif = ifp; /* now read wi_frame first so we know how much data to read */ if (wi_read_data(sc, id, 0, mtod(m, caddr_t), sizeof(struct wi_frame))) { m_freem(m); ifp->if_ierrors++; return; } ...