chap. 5: enhanced char driver operations two possibilities for misc. device control: 1. ioctl method: misc. operations, hence the name io control. 2. capturing data written to /dev/something ... although that can be made more complex if ESC sequences are used. And ESC sequences/data combinations can be tricky. man page definition: (as called from user program) ioctl(int fd, int cmd, ... (means char * ptr) Note internal calling setup for char devices, net devices is different (but function is roughly similar) ioctl(... int cmd, char * (misc data structure ptr)) must know and accurately use structure for 3rd argument in both app. program and kernel. thus there is no type checking on the extra arg. There is usually a shared header file for such an app program and the kernel driver as well. ---------------------------------------------- E.g., same character device: linux/drivers/char/wdt.c static int wdt_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) { int new_margin; static struct watchdog_info ident= { WDIOF_OVERHEAT|WDIOF_POWERUNDER|WDIOF_POWEROVER |WDIOF_EXTERN1|WDIOF_EXTERN2|WDIOF_FANFAULT |WDIOF_SETTIMEOUT, 1, "WDT500/501" }; ident.options&=WDT_OPTION_MASK; /* Mask down to the card we have */ switch(cmd) { default: return -ENOTTY; case WDIOC_GETSUPPORT: return copy_to_user((struct watchdog_info *)arg, &ident, sizeof(ident))?-EFAULT:0; case WDIOC_GETSTATUS: return put_user(wdt_status(),(int *)arg); case WDIOC_GETBOOTSTATUS: ---------------------------------------------- Choosing the ioctl commands: command numbers need to be unique across system. Don't want command deviceY to deviceX! cmd codes split into bit fields: version 1: 16 bits total: 8 bits major as MSB 8 bits for command (device, command) still exists due to binary compatibility. check: include/asm/ioctl.h and Documentation/ioctl-number.txt Current idea roughly: (magic number, ordinary number, direction of xfer, size of argument) in type: magic number, after looking at ioctl-number.txt, use it in driver. 8 bit field number: ordinal (per command), 8 bits. direction: if command involves data xfer, to or from kernel. _IOC_NONE - no data xfer _IOC_READ - from kernel _IOC_WRITE - to kernel _IOC_READ | _IOC_WRITE - 2-way xfer size: size of user data transferred size field is architecture dependent, ranges from 8-14 bits. pulls in appropriate asm/ioctl.h ... See SCULL ioctl defs p.l31-132 return value: ------------- basic idea of ioctl switch(cmd) { case A: case B: case Z: break; default: ???j return(-ENOTTY); or return(-EINVAL); } can directly return a positive integer, else negative value is assumed to be errno. posix standard is wrong ... says ENOTTY even for net device? return -EINVAL if in doubt. predefined commands ------------------- some ioctls are for the kernel at large, some for char devices, some for net devices kernel ioctls are dealt with before file/char devices 3 groups of predefined: 1. aimed at files, regular, FIFO, whatever magic number T 2. issued only to regular files 3. specific to a filesystem type e.g., implement append-only flag, or immutable flag on files for any file we have: 1. FIOCLEX - close-on-exec. if exec occurs, this fd is closed for the new code overlay. 2. FIONCLEX - clear the above flag 3. FIOASYNC - not used 4. FIONBIO - set or unset blocking i/o flag. fcntl syscall similar to ioctl, but for files. Using the ioctl argument ------------------------ how to use the final argument. Usually it is a pointer ... to a structure in user space. E.g., we want to pass this structure into the kernel. struct howto { int command; int param1; int param2; } h; We must a pointer to h in user space into the ioctl command. ?! kernel must make sure that h is sane so that kernel does not commit suicide. We might use access_ok(int type, const void *addr, unsigned long size); to check legality of access type: VERIFY_READ, or VERIFY_WRITE addr - user space addr size - size of struct in bytes returns: 1 if ok, 0 for failure. in that case driver returns -EFAULT. see code for scull, p. 135. Two sets of functions to use: general byte sizes: copy_from_user copy_to_user or fixed sizes: put_user(datum, ptr), size transferred depends on ptr size. calls access_ok internally. __put_user - does NOT call access_ok internally. get_user __get_user capabilities and restricted operations -------------------------------------- linux has capabilities for access control ... we'll skip this. except for /* must be root to do this */ if (!capable(CAP_SYS_ADMIN)) return -EPERM The implementation of the ioctl commands ---------------------------------------- switch(cmd) ... etc ... see p. 139 for user program calling examples to go with this code. device control without ioctl ---------------------------- As an alternative to ioctl, one might simply use 1 or more character devices for control. Or write to a device with "commands". Traditional serial devices have escape sequences embedded in the data. ESC might be 255 ... as ASCII is really only <= 127 in the byte. However escape sequences can be difficult in some cases. This is why a grep of a binary file might kill your xterm. Simple example: walking robot has the following commands: /dev/robot forward, backwards, leftturn, rightturn 1 - move forward 2 - move backwards 3 - turn left 4 - turn right echo "1" > /dev/robot etc ... the write method of the device simply issue the commands to the robot with a switch statement. Here there would be no problem with "data" vs "control". blocking i/o if we must read and there is no data, then what? default answer: go to sleep waiting for data. PLUS: there must be a interrupt/wakeup else we sleep for ever. going to sleep and awakening there are several wait queues for processes to wait on an event (like "data is ready", or "I need a buffer") wait queue: process queue of processes (pointers to task struct) that are waiting wait_queue_head_t my_queue init_waitqueue_head (&my_queue) / must init wait queue somewhere before using E.g., somebody gets woken up because they want a kernel io buffer ... iobuf.c: wake_up(&kiobuf->wait_queue); note: wakeups historically (and linux is no different) MAY wake up a number of processes that need to first check if they can get the resource desired, and if they cannot, go back to sleep. rough idea: resource wanted thread: loop to check on resource sleep until wakeup somewhere else TH or BH: wakeup ... sleep oriented primitives: see p. 142 sleep_on() - signal can't wake it up interruptible_sleep_on() sleep_on_timeout() - timeout interruptible_sleep_on_timeout wait_event - event checking loop wait_event_interruptible driver write may very well use: interruptible_sleep_on() wakeups: wake_up - wakeup only makes a process ready to run wake_up_interruptible wake_up_sync - this wakeup makes it RUN wake_up_interruptible_sync likely to use wake_up_interruptible rant: linux has features ... see code p. 143/144 deeper look at wait queues: see code. p. 144 bottom it is possible to have a TASK_EXCLUSIVE flag meaning wakeup only this process not a herd of processes writing reentrant code bottom line: You always want to ask if the data structure being used is 1. per task ... and/or per thread 2. and in regards to #1, is it "SMP SAFE" ... not corruptible due to races in CPUs 1. you can use the stack when it makes sense. 2. you can use global variables if they have handles and are somehow per thread state info can also be stored in "private areas" e.g., network device driver has a private structure stored in net_device structure ... for each instance of a NIC card of its type. blocking/nonblocking operations if device wants to read, you may want to block the process. Or the process may have requested to never block with a O_NONBLOCK flag which can be passed in via open(2). consider: we want to read a device, but no data yet: 2 things you can do: block and wait for interrupt/wakeup return with NONBLOCK signified somehow (-EAGAIN is one possibility) process wants to write, but no buffer free for writing block until buffer appears in some sense data devices always have input buffers and output buffers, and there are always limitations placed on them. buffers may be managed by the device directly or be a system resource (belong to a set of devices) a device may have special semantics for O_NONBLOCK: e.g., tape device might want until it has media, ignore that if NONBLOCK set. scullpipe implementation: 4 /dev/scullpipe devices in scullpipe, basically we don't have interrupts for wakeups, we have another process e.g., if we have a reader proc, writer can wake it up if we have a writer proc, reader can wake it up device structure has 2 wait queues, and a limited buffer 1. read queue 2. write queue look at code pp. 150-151 write code, p. 152 poll and select apps that use nonblocking i/o, use poll and/or select idea: app can determine if it can read without blocking, which can be useful if you have two FDs to read from select from BSD unix poll from System5 unix device driver must provide some support poll method in driver has this prototype: unsigned int (*poll) (struct file *, poll_table *); called whenever app does poll or select must do two things: 1. call poll_wait on wait queue to indicate a change in the poll status poll_table declared in , involved in wait queue mechanism driver must add any event queues it knows about that could be used for waking up a process to the poll table 2. return a bit mask describing operations that could be performed without blocking e.g., if device has data to read, then indicate that ... mostly simple to do ... get code from other drivers see scull poll method, p. 156 interaction with read/write (poll) read: if there is ANY data, return it. if there is no data, and O_NONBLOCK is set, return -EAGAIN if there is no data, and O_NONBLOCK is NOT set, block if EOF, return 0. poll returns POLLHUP. write: if there is space, write should return without delay may have less space if we wait here for the adequate space, then poll to ask "will you block" doesn't work... if output buffer is full, write blocks if O_NONBLOCK is set, and output buffer is full, return -EGAIN device may not be able to make partial writes (block-oriented) flushing pending output fsync method MAY exist so that apps can ask if data has been written out as opposed to buffered internally for efficiency reasons async notification process may want a signal to inform it as fast as possible that data is ready to read 1. use fcntl system call, and F_SETOWN command to make kernel internal file with process id 2. use fcntl system call, and set FASYNC flag 3. setup a signal using SIGIO signal(SIGIO, &input_function); driver details on p. 162 Seeking a device llseek method ... file table has offset, and lseek updates it. implicit seek vs explicit seek ... this is explicit seek. read/write must update that offset implicitly it is possible that driver might have to do some work, so llseek method exists can't seek on ttys! usually nothing to do on block devices if nothing to do on your device, put in code on p. 164 note: linux has pread/pwrite system calls, which allow the caller to specify the offset in addition to fd, buf, count. These are a form of explicit seek call. Access Control on a Device File driver may want to make access control checks of various sorts in open routine e.g., may have device that can only be used by one user at a time e.g., single-open device only one process at a time scullsingle code is example see code p. 165. bottom line: if count > 0 return -EBUSY another digression on race conditions consider scull_count variable in previous open two actions are done: 1. value of the variable is tested and open refused if !=0 2. value is incremented if device is available These tests are safe on a single cpu ... otherwise this is a race condition ... is this a race condition: if (++i > 3) { } if (i == 3) { } SMP here could result in a race condition... semaphores could fix it, but they are relatively heavyweight. (semaphore may block process) can use spinlock ... simply lock the other side out spinlock means: CPU N spins testing the variable over/over if it is free when to use: when you think the latency is short ... remember: you are turning a cpu "off" potentially spinlocks are "off" when there is no SMP you do not use spinlocks if you might sleep (watch out for read/write ...) type spinlock_t must be initialized: spin_lock_init(spinlock_t *lock); obtain lock with spin_lock(spinlock_t *lock); free lock with spin_unlock(spinlock_t *lock); restricting access to a single *user* at a time need: 1. an open count and 2. the uid of the owner of the device see code p. 167 blocking open as an alternative to EBUSY block until free implement a blocking open see code p. 168-169 open: unlock spinlock and do: interruptible_sleep_on() note: wait_q in scull device ... not global. release: must wakeup sleepers wake_up_interruptible(); ... cloning the device on open we will ignore this!