Chapter Three - Char Drivers goal: a modular char driver, not a module though. Could be a module! "scull", simple char utility for loading localities. store data in memory. no real hw. design of scull scull[0-3]. 4 char devices. each has memory area that is global and persistant. persistant simply means data stays in memory across open/close calls. can access it with cp/cat/shell i/o redirection. scullpipe() to scullpipe3. 4 fifo devices. one process reads whilst another writes. scullsingle scullpriv sculluid scullwuid similar to scull0 but limited opens. policy devices: scullsingle - only one process at a time can use driver scullpriv is console specific. sculluid/wuid - only one user at a time, can be opened multiple times. uid - returns "device busy" if another user tries to use it. wuid - implements a blocking open. only scull[0-3] in this chapter major/minor numbers char devices accessed thru /dev files. "special files"; i.e., hooks into kernel operations. for char devices, ls -l shows a "c" file type. block devices have a "b" file type. ls -l shows two numbers, major device type, and minor device type. open on /dev/something allows kernel to discover it is a special file, and then determine the driver thru a table lookup. minor device passed to the driver. (major, minor) major - which driver minor - which instance of driver e.g., ls -l of /dev/null gives us 1,3 /dev/zero gives us 1,5 this means the same driver (memory). 3 tells it to throw things away. 5 tells it return zeroed bytes on a read. cp /dev/null here (gives us zero-length file) dd if=/dev/zero of=glorp bs=1 gives us 1 block of zeros. what would this do: # dd if=/dev/zero of=glorp (let it run to completion) # rm glorp drivers may have MULTIPLE devices. classic coke reason: /dev/tty1 /dev/tty2 same serial driver, two serial ports. 2.4 introduces "devfs" the device file system, which is optional. it is different from the traditional maj/min /dev system internal representation of device numbers dev_t has major/minor value in it from 32 bits, 12 bits for major, 20 bits for minor use these macros to get major/minor values MAJOR(dev_t dev); MINOR(dev_t dev); MKDEV(int major, int minor) -> makes a dev_t previous kernels < 2.6 were limited to 255 major numbers/minor numbers. alloc/free of device numbers to get a particular devce number: int register_chrdev_region(dev_t first, unsigned int count, char *name); allocate first major/minor pair, count is total number of device numbers, name is name of device name will appear in /proc/devices and sysfs returns 0 if ok, negative is not ok. to get any device number: p. 45 alloc_chrdev_region dev is *output* unregister_chrdev_region when done note old methodology involves using mknod(8) to make device node typically in /dev dynamic allocation of major numbers in kernel static allocation of major numbers in Documentation/devices.txt new numbers are not being assigned, hence future drivers should dyn. allocate numbers need to avoid randomly picked hardwired numbers for dyn. allocation read /proc/devices to find out what the number is ... See scull_load script, p. 47 script invokes driver, reads /proc/devices to get numbers and then calls mknod ... authors suggest using dynamic allocation but passing in a parameter at load time, see p. 48 for example File Operations 3 structures: file_operations - char driver sets up connection between important system calls and itself in terms of a functional i/f. file/inode - used by file system, file allows sharing, inode is disk file system principle file data structure open devices internally have a file * structure. "an open file" has file_operations type/struct via f_op structure consider it an object, but really a set of function calls internally associated with system calls and often called by the kernel internally. you open file "foo" ... read from file "foo" ... there will be internal open/read calls from a block driver (or char driver if it is /dev/foo and is a char i/o special file). ops may be NULL, meaning not needed, do nothing... Look at ops list on p. 50-51 struct file_olperations examples llseek - change read/write offset in file read - read i/o kernl to app write - write i/o, app to kernel readdir - read directory entry, only used with file systems poll - used by poll(2) and select(2), is device readable/ writeable (so app can avoid blocking on read and do something else) ioctl - device specific commands. e.g., format trk on floppy. kernel has its own ioctls which may not call driver too. mmap - map device memory directly to the process's address space. open - open the device, 1st call flush - only used by NFS, invoked when process closes its copy of file system. note: invoked on every close. release - release file structure. invoked on last close. fsync - fscyn system call, drivers not likely to use it fasync - change in FASYNC flag, async notification mechanism lock - file locking, not usual for device drivers readv - scatter-gather i/o writev - scatter-gather i/o struct module *owner - pointer to module that owns the structure, kernel uses it to maintain a reference count on modules p. 53 scull example: struct file_operations scull_fops = { .owner = THIS MODULE, .llseek = scull_llseek, etc. }; uses tagged structure initialization (which GNU C supports ... as an extension) file structure struct file is defined in a ptr in the kernel might be called filp This is not FILE * in user programs, but struct file * in the kernel. not the same. kernel open creates a FILE. loads an inode ... file points to it. file contains "offset" somehow too. kernel keeps ref. count, releases it on last close. fd = open indirectly points to it. mode_t f_mode - the mode of the file, read/write bits checks or set with FMODE_READ FMODE_WRITE kernel checks this stuff before calling your read/write driver functions loff_t f_pos - the offset (64 bit value) unsigned int f_flags file flags used on open, O_RDONLY O_NONBLOCK, O_SYNC< etc. struct file_operations *f_op operations associated with file void *private_data used by drivers to maintain private data across calls struct dentry *f_dentry - directory entry associated with file. drivers do NOT create file/filp structures, they use them. inode structure internal representation of a file. stored on disk. loaded by 1st open dev_t i_rdev - if you are device file, this field contains the device #. struct cdev *i_cdev - kernel ptr to cdev device structure unsigned int iminor/imajor macros on p.55 should be used instead of getting i_rdev directly. char device registration include to get cdev structure see p.55 bottom for one way to allocate more likely use: p.56 top cdev_init function - init it with fops cdev_add - tell kernel to use it (it is baked), add it to a cdev list DO NOT CALL CDEV_ADD if init is not done. device registration in scull each scull device, is represented by scull_dev, see p. 56 bottom note init function The older way old code in 2.6 does not use newer cdev interface. p. 57 register_chrdev() for major, name, here is fops structure unregister_chrdev() open/release the open method driver: do any init you need when open syscall is made. open followed by read/write close is normal assumption. count is decremented by release method. open should do something like this: . check for device_specific errors ... is the device ready? . init device if it is opened for 1st time . id minor number and update the f_op pointer with another pointer . allocate fill any data structure to be put in filp->private_data (so that read can use it ...) open prototype: int (*open) (struct inode *inode, struct file *filp); we look at inode->i_cdev to get cdev info however we really want the scull_dev structure, not the cdev structure use: container_of(ptr, container_type, container_field); container_field within container_type macro - returns ptr to what we want: see example code on p. 58 OR older way look at minor device field from inode structure, register_chrdev forces this assumption. use iminor() macro. release method reverse of open, aka device_close .deallocate anything open allocated in filp->private_data .shut down the device on last close scull has no hw therefore and persistant devices note: not all close calls call the release method; e.g., fork() == 0 child code shares file pointer with parents if child exits, parent still has file open. kernel keeps usage count only last close is called kernel only calls release when file reference count hits 0 the flush method is called every time any app calls close() some pre 2.6 info from 2nd edition on open: e.g., /dev/st0 might be a SCSI tape device that will rewind when done, and /dev/nst0 will NOT rewind. scull driver and minor device upper nibble: type/personality of device lower nibble: individual device instances scull0 differs from scullpipe0 in upper nibble. scull1 different in lower nibble from scull0 #define TYPE(dev) (MINOR(dev) >> 4) /* get high nibble */ #define NUM(dev) (MINOR(dev) & 0xf) /* get high nibble */ p. 70 for each device type, scull has a specific file_ops structure, which is placed in filp->f_op at open time. Thus the driver can really be multiple drivers! there is an array of "types" ... p. 70 code for scull_open: kernel gives us inode, file pointer we determine minor info check for private_data, if none fill in f_op call f_op open routine to do the work init dev variable to point to private device structure (so the driver can communicate with its other methods) increment use count note: Scull_Dev (more below) is the data structure that holds memory in it. scull_nr_devs and scull_devices[] are number of avail devices, and array of pointers to Scull_Dev scull0-3 devices are open and persistant (don't go away across a close) do not keep per open count, just module reference/use count. scull_trim(dev) - throws memory away, and reinits device. down_interruptable and up are Linux internal semaphores there is one declared in the Scull_Dev structure. they serialize access to scull_trim (SMP remember ...) end pre-2.6/2nd edition scull's memory usage how/why does scull do memory allocation? scull device is a memory region more you write, bigger it gets trimming done by overwriting device with smaller file therefore: # cp /dev/zero /dev/scull0 will eat up all RAM memory ... dd can be used to move data into scull scull uses two core kernel functions: kmalloc/kfree, use GPF_KERNEL with kmalloc each device is a linked list of pointers. each of which points to Scull_Dev struct use an array of 1000 pointers to areas of 4000 bytes. each memory area is a "quantum" the array is a quantum set (a bunch of quantums) See Figure 3.1, p. 61 each Scull_Dev points to the next Scull_Dev and an array of quantum/s ... scull_dev on p. 56: data ptr - quantum set Scull_dev *next - next device quantum/qset - sizes of those items size - handle - used with devfs sem - for kernel mutex scull_qset on pl. 62. p. 62 scull_trim input is a scull_dev this function deallocates memory note kfree gets rid of individual quantum and block ptr to quantums, and then array itself, 2-level memory hierarchy A Brief Intro to Race Conditions 2 processes, A and B both have the same scull device open for writing. Both try "at the same time" to append data to the device. they might just stomp on each other's storage a race condition ... means there are 2 threads at least, and no synchronization ... no way to say who gets there first (there is usually memory) consider: single CPU TH vs BH can have race conditions TH vs TH not possible, because non-preemptive in kernel mode. SMP CPU 1 and CPU 2 by definition can have race conditions for shared memory. we need either synchronization primitives or architecture that enforces synchronization by its design linux has kernel semaphores struct semaphore, scull has one per device hence stored in Scull_Dev structure note sema_init p. 110 linux calls P: down(&sem) or down_interruptible(&sem) 1st choice is always down_interruptible to allow signals to work ... to release (V): up(&sem) to mess with scull data you have to be synchronized thru the semaphore read/write methods read/write are inherently similar - a mere matter of which way data is copied: user space <---> kernel space master uzen sez: "from kernel's POV: read looks like a write." prototypes: ssize_t read(struct file *fip, char __user *buff, size_t count, loff_tg *offp) ssize_t write(struct file *fip, char __user *buff, size_t count, loff_tg *offp) filp - file pointer count - requested count buff - points to user buffer for data offp - file position user is accessing ssize_t - signed size type buff is __user. why? we must xfer data between kernel space and user space. The kernel and the user process are in different virtual address spaces (and all user proc addresses "are the same"...) memcpy/bcopy are not used and should be assumed to be suicidal unless you know better. of course you may not have an mmu. note: user space memory could be *paged out*. go ahead use bcopy: oops ... you killed the user proc. user space pointers may be wrong. functions defined in are used see: p. 64 e.g., we could use: read: copy_to_user or write: copy_from_user consequences of calling this: note: you may block. these functions also check the validity of user space addresses (which may be invalid) return -EFAULT if so. in general, post i/o you update offp to represent how far you have moved in the file. so offp can be both an input (start here ...) and an output ... we got to here. Remember lseek returns this info. read/write implicitly advance the file offset. driver must take care of that. we need to update the file offset ptr. note that pread/pwrite system calls exist and do not implicitly change the file offset. pass in an offset. See Figure 3-2 read arguments. return value is negative for an error. >= 0 for success. 0 means EOF of course. Note that kernel functions can return a negative error number, but user space sys. calls see -1, with the error put in the linked global int errno. read method partial reads may occur. app asks for count = N. driver returns < N. library code or app decides what to do... partial reads are popular in char device drivers, not likely in a block device driver. note semantics of return value for read: (count). 5 sacred cases for char driver: .we got all the bytes .read < count .0 - EOF .an error .there isn't any data you fool, shall we block til there is? errors are in . Errors look like: -EINTR or -EFAULT. It is always possible that a read system call might block waiting for data (TCP read, pipe read, etc.). scull_read only deals with one quantum at a time. app must loop ... to read all the data (this is tacky) if current read position > stored data, return 0 read code, p. 67 write method: write can xfer less data than was requested acc. to its rules: . if value == count, fine . if value is positive, and smaller than count ... program must retry with rest of the data. (my opinion: this is a bug - don't do it, unless you are writing the app program). . if value is 0, program must retry . negative is an error, of course. code deals with a single quantum at a time readv/writev read vector/write vector more or less "scatter/gather" io, or put another way, a list of buffers handed to/from kernel for i/o if driver does not have readv/writev entry points, kernel infrastructure will call read/write methods multiple times more efficient to have them see vector opts, p. 69 struct iovec is key. p. 69-70 note this semantic: if given a writev, you write the data in some "contiquous" fashion. E.g., with a UDP datagram, you might have ethernet/ip/udp/data in 4 vectors. playing with new devices cp/dd i/o redirection can be tested. free command can be used to see amount of free memory printk can be put into driver to observe how it reacts use strace on cat/cp/dd/ls -l > /dev/scull0 system calls take effect you are now dangerous.