Chapter 14: Network Drivers network device must register itself and xmit and recv packets network devices do not appear in /dev, although they may put something in /proc sockets upstairs, network device downstairs may have many sockets to one device block devices: interrupt because I/O was scheduled network devices: output: interrupt because I/O was scheduled on *output* (transmit) input: interrupt because async I/O came in on receive big functions: send/recv misc functions modify misc. device characteristics full-duplex for ethernet ad hoc wireless channel for 802.11 track errors (for netstat -in) support promiscuous mode -- hw support multicast addresses in HW filter look at simplified device called snull note: loopback device is at drivers/net/loopback.c How snull is designed ip specific assigning ip numbers device has two sub-devices, on xmit, the other one gets the pkt sn0 and sn1 we want dst ip to be the other interface, but NOT somehow sent to localhost we must modify src/dst ip to make this device work. This is not normal for drivers (it is normal for tunnel drivers like ipip.o). snull toggles least significant bit of 3rd octet of ip addr if we send pkts to net A, appears on sn1 as net B. See Figure 14-1 symbolic names for ip addresses: sn0: snullnet0 is sn0 ip/class C network address sn1: snullnet1 is sn1 ip/class C network address these addresses differ only in LSB of 3rd octet local0: sn0 ip local1: sn1 ip above 2 differ in both LSB of 3rd octet, and last byte. remote0: host in snullnet0, 4th octet same as local1 remote1: host in snullnet1, 4th octet same as local0 send a packet to remote0, ends up as input on local1/sn1 note: those addresses in book are private class C addresses. note ifconfig setup and trial ping in book. physical transport of packets snull will belong to ethernet class devices have classes; i.e., what kind of header. work can be shared sometimes (e.g., ethernet headers) packet sniffing mechanism needs to know that too, as headers are NOT stripped or arp needs the MAC addresses lots of help for ethernet devices, note that parallel device tells kernel it is ethernet (it is NOT) you can use tcpdump -n connecting to the kernel as usual a device has a certain set of registration functions note: isa-skeleton.c and pci-skeleton.c or 3c509.c as a "simple" (no such thing) device. module loading e.g., isa device might probe, request port region, interrupt ... driver also inserts a dev structure into global list of such things struct net_device *dev ... (kmalloc the structure) linux/netdevice.h in net_device we have name field: "sn0" or "eth0" drivers must register their own devices ... fill in some function pointers in net_device call when ready: register_netdev(*dev) init function (also probe function) my_dev_init(*dev) probe for the device. ... sw thinks it exists ... does it really exist if so, initialize it ideally avoid alloc ports/irq until "open" time ... may not be the case post call to dev->init by kernel, expect it to be filled in to work see p. 432 for code for snull priv structure allocated at init time, not open time priv structure is for private data for driver includes stats info ... modular and nonmodular drivers drivers/net/Space.c contains list of probes ... turned on via kernel build time CONFIG statements Space.c/net_init.c are of importance at boot or module insertion we need to call probes and link up net_devices net_device struct in detail can view fields as inited at compile time or dynamically visible (compile) vs invisible (dynamic) visible name rmen - device memory base_addr - ports irq - ISA irq dma - dma channel state - device state, drivers do not manipulate directly. use utility functions. *next - next device in global linked list of devs init - init function hidden fields question to ask about a given field: is it driver internal or kernel/driver api ether_setup() is a general function that does much work fills out quite a few of the fields hard_header_len: how many L2 bytes. ethernet == 14, 6+6+2 mtu tx_queue_len: 100 set via ether_setup. max tx queue size type: type of device, ARP uses it. ARPHDR_ETHER addr_len = 6 broadcast[] = 0xffffffffffff dev_addr[] = mac addr flags, if flags flags ... see book IFF_UP means open called most of the flags are based on BSD IFF flags, but not all device methods open - ifconfig calls this to activate the device. Should take o.s. resources here ... # ifconfig eth0 ip (or up) stop - stop the if, release o.s. resources # ifconfig eth0 down hard_start_xmit (sk_buff, and dev (packet/device)) xmit a packet hard_header - build the hadware header from MAC addresses rebuild_header - rarely used in 2.4 kernel tx_timeout - driver function called when pkt transmission fails this is a classic watchdog timer hook net_device_stats - method called to provide network driver stats. e.g., netstat -in called set_config - entry point for driver configuration. idea was to change mac/irq. may not be needed. do_ioctl - ioctl handler. set_multicast_list - set multicast IP addresses. set_mac_address - set device MAC change_mtu header_cache - fill in the hh_cache structure with results of an ARP query. drivers can use the default eth_header_cache implementation header_cache_update - update dest MAC in hh_cache in response to a chance. Ethernet drivers use eth_header_parse utility fields net_device data fields can contain useful status info. some fields used by ifconfig/netstat. u long trans_start/last_rx hold jiffies value (kernel clock timer int) trans_start = when you sent it last_rx = when you got a packet watchdog_timeo - in jiffies. how long to decide that tx_timeout needs to be called priv - pointer to private data. mc_list/mc_count - list of multicast addresses and count of same spinlock_t xmit_lock - spinlock used to avoid multiple simultaneous calls to drivers hard_start_xmit function int xmit_lock_owner - CPU that got the lock driver does not touch these fields owner - module that owns driver opening/closing driver can probe at module load time, kernel boot time, or during open. open happens when we do #ifconfig ifconfig does: 1. assign address with SIOCSIFADDR, then this is device independent ... upper kernel layers do it 2. turns interface on, with IFF_UP flag via SIOCSIFFLAGS driver must turn ON if shutdown occurs: 1. SIOCSIFFLAGS to clear IFF_UP 2. stop function called open tasks allocate bus resources if needed turn on "hw" (interrupts) enable tx ... netif_start_queue(*dev) copy MAC address from hw to sw representation snull open like so: int snull_open mod use count setup addressing start xmit capability snull_release netif_stop_queue(dev) mod use decrement count packet transmission kernel calls hard_start_xmit to send function pkt contained in "socket buffer" pointer is skb (socket buffer) hard_start_xmit has *L2* buffers in it already interface does not need to mod it skb->data points to data skb->length is length of data snull_tx (see book. p. 445) controlling tramsission concurrency hard_start_xmit protected from concurrency problems by spinlock can't be called > 1 time whilst in progress (hw may not be able to stand it ... to say nothing of queues ...) real hw may have limited storage (memory) and ptr chains of its own netif_stop_queue "stops the queue" optimal performance: many pkts queued ... internally somehow transmission timeouts hw may fail and fail to interrupt. you sent a pkt ... you should get at least one interrupt sooner or latter in jiffies set value in watchdog_timeo field ... if time exceeds that top layers called tx_timeout method goal of tx_timeout is to knock hw on head and get it going again (reset or deal with missed interrupt) snull_tx_timeout function can be used to play with this functionality see sample code p. 448 packet reception must allocate sk_buff and hand off to higher layers get interrupt ... allocate buffer ... fill it in ... hand it off snull_rx gets ptr to data in kernel memory see p. 448 note call to dev_alloc_skb note call to netif_rx(skb) ip_summed is interesting ... hw may do csum calculation ... interrupt handler interrupt reasons (events): 1. we transmitted a packet 2. we received a packet 3. something else (stats are ready) typically we check a status register to determine what kind of event occured switch (status) TX RX other error snull interrupt handler is software driven ... snull ... p. 450-451 get status if TX free memory associated with xmitted buffer changes in link state outside network may fail ... can we detect that. with pt/pt circuits, yes. with ethernet ... may not be so clear (with shared media, no, with pt/pt yes) netif_carrier_on() netif_carrier_off() socket buffers important fields rx_dev - device recv. buffer dev - device sending buffer union ... h/nh/mac h - L4 headers nh - L3 headers mac - L2 headers e.g., TCP ports in skb->h.th head/data/tail/end head - pointer to beginning of allocated space data - pointer to beginning of data tail - end of valid data end - max address tail points to available buffer space is: end-head current used space is: tail-data len - the length of data ip_summed, checksum policy for pkt pkt_type - PACKET_HOST (for me) PACKET_BROADCAST ... socket buffer functions alloc_skb dev_alloc_skb - driver uses this kfree_skb dev_kfree_skb - driver uses this skb_put - add data to end __skb_put skb_push - add data to front __skb_push skb_tailroom - amount of space available skb_headroom - amount in front available skb_reserve - increments both data and tail most ethernet drivers reserve 2 bytes AFTER mac hdr, so that IP hdr is aligned on longword (ethernet hdr is 14 bytes long) skb_pull - remove data from head of pkt mac address resolution how is arp handled in terms of ethernet drivers? dev->addr, and dev->addr_len need to be set at open time ether_setup does the open time work, by assigning methods to dev->hard_header, and dev->rebuild_header hard_header is called to layout the info needed for the L2 (TBD: by whom?) if you don't use arp, you need to override the hard_header function with your own snull_header ... goal is to fill out ethernet header eth_type_trans() called on input: extract ethernet type assign skb->mac.raw remove hw header set skb->pkt_type (PACKET_HOST etc) so this is important for promisc. mode functionality if set to PACKET_OTHERHOST meaning I got it, but it wasn't for me acc. to MAC of incoming interface acc. to book: "netif_rx will drop any packet of type PACKET_OTHERHOST". non-ethernet headers e.g., slip/ppp ... perhaps you don't want a header, because you are point to point. custom ioctl commands ioctl has a general structure for passing info as a 2nd/3rd function arg. in driver ioctl(*dev, struct ifreq *ifr, int cmd) defined in if ioctl not recognized by protocol layer, then passed to driver struct ifreq *, defined in plip (parallel ip) allows ioctl to be used to modify internal timers 16 possible commands using: SIOCDEVPRIVATE to PRIVATE+15. dev->do_ioctl is called ... uses switch to dispatch command ifr points to kernel buffer after do_ioctl returns, structure copied back to user space stats driver needs method for get_stats stats are in dev structure real work distributed thruout driver rx_packets - packet count tx_packets - packet count rx_bytes - byte count tx_bytes - byte count rx_errors tx_errors rx_dropped tx_dropped collisions multicast - # of multicast received multicasting pt/pt interfaces don't really care (meaningless) takes extra work ... we need to keep a set of HW multicast addresses for multicast filtering to work old ethernet i/fs may not be able to deal with multicast IFF_MULTICAST is not set at open time some other broken interfaces need sw detection to filter out multicast pkts not wanted by us. goal: do it in hw if at all possible. kernel support for multicasting set_multicast_list - if hw list changes, this function is called from above also called whenever dev_flags changed ... as we may need to reinit the hw dev_mc_list - list of multicast addresses in dev the list can be used to do software filtering if necessary. ideally we push it into the hw. dev->mc_count - count of multicast addresses in listo IFF_MULTICAST - we do it if not set, won't be asked to do it. IFF_ALLMULTI - can do prom. mode multicast. recv. all multicast messages. multicast routers must do this. this is tricky. IFF_PROMISC - every packet should be received, including ones not for us. See p. 462 for dev_mc_list typical implementation see book. summary: what exactly did we learn about the function of the packet socket, bpf, promiscuous mode. answer: we set IFF_PROMISC to do something here, but apparently the receive side functionality is upstairs. we have to chase netif_rx().