chapter 12: loading block drivers 1. introduction in addition to net/char drivers, we have block drivers primarily for hard disk. provides access to block device of course. [root@zibble root]# ls -l /dev/hda* brw-rw---- 1 root disk 3, 0 Aug 30 2001 /dev/hda brw-rw---- 1 root disk 3, 1 Aug 30 2001 /dev/hda1 brw-rw---- 1 root disk 3, 10 Aug 30 2001 /dev/hda10 brw-rw---- 1 root disk 3, 11 Aug 30 2001 /dev/hda11 brw-rw---- 1 root disk 3, 12 Aug 30 2001 /dev/hda12 brw-rw---- 1 root disk 3, 13 Aug 30 2001 /dev/hda13 brw-rw---- 1 root disk 3, 14 Aug 30 2001 /dev/hda14 brw-rw---- 1 root disk 3, 15 Aug 30 2001 /dev/hda15 brw-rw---- 1 root disk 3, 16 Aug 30 2001 /dev/hda16 brw-rw---- 1 root disk 3, 2 Aug 30 2001 /dev/hda2 brw-rw---- 1 root disk 3, 3 Aug 30 2001 /dev/hda3 brw-rw---- 1 root disk 3, 4 Aug 30 2001 /dev/hda4 brw-rw---- 1 root disk 3, 5 Aug 30 2001 /dev/hda5 brw-rw---- 1 root disk 3, 6 Aug 30 2001 /dev/hda6 brw-rw---- 1 root disk 3, 7 Aug 30 2001 /dev/hda7 brw-rw---- 1 root disk 3, 8 Aug 30 2001 /dev/hda8 brw-rw---- 1 root disk 3, 9 Aug 30 2001 /dev/hda9 Note major device: 3 [root@zibble root]# df Filesystem 1k-blocks Used Available Use% Mounted on /dev/hda3 27585276 3719476 22464532 15% / /dev/hda2 46668 9053 35206 21% /boot none 127672 0 127672 0% /dev/shm (not a lot of partitions actually used) ... # fdisk /dev/hda Command (m for help): p Disk /dev/hda: 255 heads, 63 sectors, 4865 cylinders Units = cylinders of 16065 * 512 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 1305 10482381 c Win95 FAT32 (LBA) /dev/hda2 1306 1311 48195 83 Linux /dev/hda3 1312 4800 28025392+ 83 Linux /dev/hda4 4801 4865 522112+ f Win95 Ext'd (LBA) /dev/hda5 4801 4865 522081 82 Linux swap [root@zibble proc]# mount /dev/hda3 on / type ext3 (rw) none on /proc type proc (rw) usbdevfs on /proc/bus/usb type usbdevfs (rw) /dev/hda2 on /boot type ext3 (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) none on /dev/shm type tmpfs (rw) probe messages something like this: Nov 13 14:42:14 zibble kernel: hda: WDC WD400BB-32CLB0, ATA DISK drive Nov 13 14:42:14 zibble kernel: blk: queue c03c9f40, I/O limit 4095Mb (mask 0xffffffff) Nov 13 14:42:14 zibble kernel: hdc: CD-RW CRX100E, ATAPI CD/DVD-ROM drive Nov 13 14:42:14 zibble kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 Nov 13 14:42:14 zibble kernel: ide1 at 0x170-0x177,0x376 on irq 15 linux block device is messy 1. important and hard to change 2. need to be optimized and run fast ... need for speed is important examples drivers in book: 1. sbull - ram disk driver 2. spull - variation with partition tables real drivers will have interrupts ... and other messy details (dma) block - block of data as determined by OS, say 1 block == 1k block > sector, power ** 2. sector - 512 bytes registering the driver see code, p. 322. register_blkdev() - similar to how char driver registered. note struct block_device_operations: char devices needed file_operations blk devices need block_device_operations like so: open - as with char devices release - as with char devices ioctl - as with char devices check_media_change - later revalidate - later no owner field ... block devices must do their own usage count NOTE: no read or write, because you must register a single routine and a queue to deal with that. in general: i/o performed on block devices is buffered ... a cache exists for access system calls ------------ buffer cache ------------ drivers although there are administrative commands like fsck that seem to act directly on so-called raw devices (raw devices ... rant goes here. Note that linux has /dev/raw as a device-independent mechanism so device drivers need not implement it). more info on raw devices in chapter 13 read/write routine called "request" routine. must register it. see code. p323. we register: 1. function to do request 2. our request queue init function sets up queue/request/ops ... the block device queue is maintained by the kernel and indexed by major name (mount /dev/hda3 /tmp ... is how the kernel maps and inits the device so that it knows that access to /tmp go thru the driver) misc additional globals that should be set: index is major number. blk_size[][] - major and minor used to access. size of each device in kbytes. blksize_size[][] - size of the block used by each device in bytes hardsect_size[][] - 512 bytes usually read_ahead[] (read ahead at block level) max_readahead[][] - number of sectors to be read ahead on file access (read ahead at file level) max_sectors[][] - max size of a single request sbull device sets these load time variables: (which are used in turn to load the above globals) size=2048, 2megs ... blksize=1024, software block size in bytes hardsect=512 rahead=2 see code for init on p. 327 note call to register_disk which is used to setup partition table, but sbuff doesn't have that feature ... cleanup calls: fsync_dev to flush cache to "hardware" header file blk.h include MAJOR_NR must be declared before you use the header (see code on p. 330) MAJOR_NR - the major number DEVICE_NAME - string DEVICE_NR (kdev_t device) - probably the minor device number but it should represent the device (volume) not the partition DEVICE_INTR - points to bottom-half handler that is currently being used (might change if different interrupt handlers for different kinds of interrupts exist) DEVICE_ON - macro for performing processing before xfer DEVICE_OFF - after xfer e.g., turn floppy motor on/off DEVICE_NO_RANDOM - end_request() contributes to system entropy if you don't think your device should do that, define it. DEVICE_REQUEST - specify request handler - don't need it now. p. 330, sbull examples for above. insert TH-versus-BH rant here linux TH enqueues the block and makes optimizations (elevator and/or clustering) in block queue to minimize seek times. there exists a special disk runtime queue. BH "executes" device request queue. May choose to do it different ways, although the way the book suggests does seem to be normal. BH is linux BH as well as interrupt-side. kswapd or other mechanisms may arrange to drive the "tasklets" in a special task queue. linux has buffer cache, and Top-half system call code (read ...) may find the buffer in the cache, and not need to bother the driver. TH processes sleep ... waiting for disk i/o completion. handling requests: a simple introduction request queue kernel top-half wants to xfer data, it puts request in driver function and "starts" driver request function does: 1. check validity of request. e.g., make sure request is NOT off end of block size array. common macro INIT_REQUEST defined in blk.h is used here. 2. perform the actual data xfer (or start it ...) CURRENT is used and is pointer to struct request 3. clean up the request just processed ... performed by end_request, which is defined in blk.h ... does wakeups, and managed CURRENT variable, making sure it points to next request in request queue. 4. loop back to beginning ... see example, p. 331 for request function, which does not xfer any data. INIT_REQUEST performs return if queue is empty. request must be atomic, and must not sleep. It can be called at interrupt time or via a tasklet. note: request is not running in the context of any particular process. ********************************************************************** Performing the Actual Data Xfer struct request CURRENT used to access these fields (CURRENT *request*, not the current TH process) kdev_t rq_dev - device, has minor number. int cmd - READ, WRITE unsigned long sector - 1st sector to be transferred unsigned long current_nr_sectors # of sectors to xfer. char *buffer - pointer to where buffer data should be written or read from struct buffer_head *bh (every buffer data area has a head pointer for linkage in the buffer cache) code p. 333, sbull_locate_device() determines internal device structure sbull_transfer makes one transfer p. 334, sbull_transfer routine ptr is which device/which sector * sector_size memcpy is used as this is a ram disk ... handling requests: the detailed view more details ... i/o request queue per device time needed to seek is slow, needed to xfer is fast optimize seek if possible cluster continquous blocks is another possibility elevator algorithm used to optimize seeks go in/out (up/down) with seek heads but add a new request in such a way to minimize seek time if possible fairness is an issue here request_queue structure has pointers to functions that may make queue operations (e.g., elevator algorithms) plus driver request function buffer cache hashed lists ... buffer head for each buffer data area buffer_head fields include: b_data -> to data region b_size - size of data region b_rdev - device holding the block b_rsector - sector on device b_reqnext - linked list of buffer head structures b_end_io - who to call when i/o on this buffer is complete block's passed to driver are in buffer cache ... (or are made to look like it) request queue manipulation functions for manipulating the request queue i/o request lock queue is classic case of potential TH/BH race conditions as well as SMP race conditions request queues are protected with io_request_lock must hold lock and disable interrupts kernel calls request function with io lock held ... expensive lock to hold ... good if you can drop it quickly blk.h macros/functions INIT_REQUEST 1. makes consistency checks on the request queue 2. returns if request queue is empty end_request driver has done one buffer request and then must call this function 1. complete i/o processing, call b_end_io to wakeup any TH process waiting on the buffer or on events like "no buffers available" 2. remove buffer from request list, and update request structure fields as well 3. call add_blk_dev_randomness to update entropy 4. release finished request back to system, unlock io_request_lock clustered requests if requests are for adjacent blocks, we might merge them this is not the default but the driver may choose to do so e.g., linux floppy driver tries to write an entire track in a single operation high-performance disk controllers can do scatter/gather i/o active queue head kernel leaves activei queue head along -- won't put anything in front of it driver may remove request before it processes it if so, it can tell the kernel it is ok, to modify the queue head blk_queue_headactive(queue, 0); multiqueue block drivers a driver may have > 1 disk, and therefore need > 1 queue. driver must define its own request queues, e.g., sbull could put request_queue_t queue; int busy request queues must be initialized ... see code p. 343 blk_queue_headactive() also puts these queues in the global blkdev structure so they can be found driver must also implement queue mapping function to help kernel find a particular queue p. 344 example of multiqueue function doing without the request queue request queue basically allows for TH/BH asynchronous operation. and chance at optimization for slow seek devices but you might not need it; e.g., a ram disk doesn't need it. memcpy might as well just be done. sbull has request queue and processes it synchronously (from the request point of view), but doesn't need it. block i/o requests are placed on queue by call to __make_request, ... queue has that as default routine, but can override with blk_queue_make_request if so desired. make_request must: 1. arrange to xfer the block 2. see that b_end_io is called when xfer is done. kernel does NOT hold io_request_lock when calling this function, so function must acquire the lock itself if it is for some reason using the request queue. raid device: has multiple devices, needs to map i/o to particular devices "sub" device driver. if make_request returns non-zero, kernel will try again, thus device remapping can occur. RAID can make layer of indirection: call RAID device 1st time, then call real device 2nd time see code p. 347 how mounting and unmounting work block devices are mounted on the FS. kernel mounts partition on directory in FS. makes sure previous directory access now goes to partition "/" (sub-root) for future inode pathname lookups. does open on device driver passes filp with f_mode, which is either: FMODE_READ FMODE_READ and FMODE_WRITE both set mkfs/fsck can call driver directly (or dd for that matter) release method called at unmount see code. p. 349 the ioctl method BLKGETSIZE - return # of sectors. BLKFLSBUF - flush buffers, see code BLKRRPART - reread partition table BLKRAGET/BLKRASET - get/set block-level read-ahead value BLKFRAGET/BLKFRASET - get/set filesystem read-ahead value BLKLROSET/BLKROGET - read-only flag for the device BLKSECTGET/BLKSECTSET - retrieve and set maximum number of sectors per request BLKSSZGET - returns sector size of this block device BLKPG - add/delete partitions, implemented in a general way. BLKELVGET/BLKELVSET - elevator request tweaking, again, implemented in a general way HDIO_GETGEO - get disk geometry it is likely that only: 1. BLKGETSIZE and 2. HDIO_GETGEO need to be implemented in the driver itself. other calls can be passed to blk_ioctl() see code p. 351 removable devices block_device_operations has support for removable media 1. check_media_change has device changed since last access? 2. revalidate reinit driver's status after disk change sbull: if you leave device unmounted long enough ... the disk disappears after 30 seconds. next access allocates a new memory area. check_media_change code, return 1 if device has changed or it may have changed. revalidation: called if change occurs. sbull: creates disk area if none found. extra care when a mount occurs, the system calls check_disk_change to check for any change, however opens can occur without calling mount (fsck) therefore: if device is unmountable, call check_disk_change() yourself. partitionable devices devices are usually divided up into partitions. fdisk used for this. spull: demonstrates partitions typical setup: /dev/hda access entire disk /dev/hda1 etc are sub-partitions /dev/hda2 ... spull has: /dev/pd and /dev/pda thru /dev/pdd, a/b/c/d as 4 whole devices (units) minor number: least significant 4 bits are partition most significant 4 bits are unit number therefore 4 entire devices, can be broken up into 16 partitions each generic hard disk device needs to understand its own partition setup in terms of how many blocks on this partition, etc. kernel offers generic support for all drivers can hide partition details from the driver struct gendisk structure describes layout of the disk, kernel maintains global list of such structures struct gendisk int major - major number for device ` char *major_name, e.g., "hd" int minor_shift - bit shifts neede to extract drive number from minor mumber, e.g., 4. int max_p - max. number of partitions struct hd_struct *part - decoded partition table for device, used to find sector range int *sizes - array of ints with same info as blk_size array int nr_real - number of units that exist void *real_devices - private area for device void struct gendisk *next - next hard_disk structure struct block_device_operations *fops; - ptr to block device ops for this device many of these fields are setup at init time partition detection at init time, must set things up for partition detection driver must fill in partition table info is viewable in /proc/partitions register_disk - handles job of reading partition table partition detection using initrd at boot linux offers initrd initrd idea: load a root ram disk and run programs for it at boot. e.g., load hard-disk driver *module* which is not part of kernel goal: 2-phase boot using kernel with minimal set of drivers, then possibly custom set of modules in initrd. device method for spull ignore this ... interrupt-driven block drivers request routine starts i/o on one block interrupt driver finishes i/o and then calls request see code example afterwards can have a new root