Introduction chapter notes from ORA linux kernel internals book: history of unix rant: no details here, but ... there are other threads freebsd/linux in various forms what do they have in common? FSF software for one thing X server/client for another See http://www.gnu.org open source ... what is open source SW? how many variations on linux theme? debian RH mandrake can you offer up any differences usoft vs open-source does closed-source prevent hackers from finding security bugs OR does open-source help coders remove security bugs so-called many eyes theorem Linux - Linus Torvalds, 1991 started with AST's minix, but he felt it was inadequate wanted open src os on x86 others have ported it to other architectures Linux is copy-lefted, under the FSF GNU license assignment: track it down and read it. how does "copyleft" different from copyright? how is it different from BSD src license kernel is at www.kernel.org linux distributions take it and other FSF code and add value for example RH installer. debian apt-get peanut gallery time: why doesn't debian use the RH installer? why doesn't RH use apt-get? linux attributes include: 1. nonpreemptive kernel ?: can we have multiple threads in kernel code on one cpu running "at the same time". assume no to start with. 2.4 is nonpreemptive in kernel mode, although there is a great deal of locking for SMP (but that's > 1 cpu). 2.5 removes this restriction. This means that chunks of kernel code assume they have a cpu to themselves. 2. multiprocessor support: SMP. question here is always how much serialization can the kernel support. Can one kernel thread lock out all the other kernel threads on multiple CPUs. 3. file system has virtual file system a la Sun pioneered notions. Can't escape unix heritage!. ext2 - rough equivalent to old BSD UFS (in other words not a bad file system). See Tso, etc. paper on it. ext3 - journaling file system, crash resistant. reiserfs - can deal well with small files. 4. STREAMS - Came out of Bell Labs, notion of dividing device drivers up into separate stack-layer and piling them on top of each other. linux doesn't have this. contrary to book which is being polite, best to NOT have it. many years of research in Distributed O.S. lead to client-server model as the architectural goal of an O.S. does linux care, no? roughly monolithic acc. to AST, (monolithic or distributed as 2 major categories of arch. flavor for OS but) linux does have a lot of loadable drivers instead ... more than FreeBSD in fact. This helps somewhat to deal with o.s. complexity problems. linux advantages include: 1. free. so just what is RH selling anyway? 2. customizable. 3. runs on cheap/old CPUs 4. powerful 5. high standard for src code quality 6. kernel CAN be small and compact even though a given distribution may not be 7. high degree of compatibility can mount various usoft file systems, other os file systems. tcp/ip and ethernet etc. can execute some binaries compiled for other systems. (FreeBSD ironically can execute x86/linux binaries which makes the smaller open-source o.s. survive ...) 8. well-supported 1. usoft 2. linux solaris/freebsd have a harder time of it hw dependency: how do we hide hw dependencies in kernel source base: in /usr/src/linux we have arch/include directories arch e.g., has sub-directories for hw specific platforms including: alpha/arm/cris/i386/ia64/m68k/mips/mips64/parisc/ppc/sparc ... etc linux versions 2 branches of kernel: stable vs developmental version has 3 numbers: 2.4.18 means: 2.4 - version, 18 is release so for example, 2.4.x is stable, 2.5.x is developmental 2.4 replaced 2.2 series 2.3 turned into 2.4 at some point 2.5 work started in nov 2001. stable is driven by features and bugs kernel algorithms and data structures in theory left alone basic os system concepts os == kernel basic set of underlying capabilities that offers system-call based Virtual Machine to application processes 2 objectives: 1. interact with hw and shield apps from knowledge of hw. implements various abstractions like file, or process. 2. provide execution environment to apps, "user programs" based on syscall API. modern os do not allow app process to interact directly with hw; e.g., you cannot reprogram ethernet controller ... done with basic hw privilege level (rings in Intel speak), you are either: kernel mode, or user mode. We shall call these "modes". examples: In kernel mode, we can execute a halt instruction. In user mode we cannot. In kernel mode we can program a FD controller directly. In kernel mode we may be able to write to physical memory somehow, in user mode we operate in a Virtual Memory space. note in real life, kernel mode is probably virtual too after the MMU is turned on at boot. multiuser systems two ideas here: multi-user system allows > 1 user at the same time. often have root/supervisor (with some kernel mode privileges) and user 1..N. call this multi-user. we may have one CPU, but we can have "virtual CPUs", else how do we have multiple users at the same time. process is abstraction of CPU state, and has process context, which must be saved/restored by scheduler. call this multi-process. multi-user os features: . must authenticate user id. . protect one user process from another user process . possible account of user "work" and user units of storage, e.g., file system quotas. UNIX always enforces hw privilege, for kernel vs user mode. Note: kernel process switch means user mode becomes kernel mode; i.e., we enter the kernel. users/groups user has private space on machine; e.g., home user directory, and a "shell" process that starts at $HOME when you login. o.s. must enforce this idea. users id: User ID, or UID. we have user handle, and password associated with it, traditionally in /etc/passwd. jrb: ... some password ... to share files with other users, we have the concept of group. GID, a set of users with shared access rights. UNIX always has root/supervisor. system admin must login as root, to do certain tasks (e.g., reboot, install os, upgrade programs) root CAN do # rm -fr / for better or worse ... # su program allows user to become root, or root to become user (depending on permissions in the 1st case) processes process is virtual cpu and fundamental abstraction. process has context: at heart this is set of registers for cpu, plus MMU register setup. larger notion: kernel variables (e.g., current working directory) for a given process. scheduler may schedule another process to run if current running process uses up cpu time quantum multiprogramming - concurrent processes, 1 cpu (or 1..n cpus) if we only have 1 cpu, we can only have 1 process running in real life at a time, although others may appear to be running. processes on UNIX are typically preemptible; i.e., scheduler can kick them out. top-half/bottom-half rant here ... kernel is shared library ... and is non-preemptive in kernel mode. i.e., user thread mode switches into kernel run in kernel TH until it gives us CPU voluntarily call this a context switch or coroutine function. process A in TH calls switch() switch() save my context scheduler chooses better process control block restores its context process B returns from switch() kernel architecture unix is monolithic in AST sense. microkernel idea associated with DOS architectures. kernel might consist of just message passing primitives (send/recv) scheduler intr handlers rest of os (file/memory/process/device) is server ... sends messages to implement system calls. monolithic architecture does not have "servers" and microkernel. Instead system-calls TRAP (software intr) with user->kernel mode switch into TH of kernel. linux tries to get DOS benefits via modularity and loadable drivers. loadable driver idea based on runtime dynamic linking. module advantages include: 1. modularized approach! 2. platform independence 3. frugal main memory 4. no performance penality due to message passing. loadable module is in the kernel file system overview files 1. a file conceptually is an array of bytes, from 0..max-1. the kernel does not interpret it (we do not have "database" files with some concept of a variable-length or fixed-length record with keys). 2. files live in directory name space that starts with root. We have root directory at the start of the tree (/). We may have the same name in different directories. 3. each process has a CWD. pathnames are relative to the CWD. or absolute (not relative to CWD). 4. the process uses a pathname to id the file. 5. . and .. may be used. . means here, and .. means in the directory up one, the parent directory. hard/soft links Actually traditionally links are directory entries. However due to inodes it is possible for a file to have two "names" (links) in different directories. This is fundamentally how the file system is glued together as a directory is a file with 2-tuples of (name, inode-number). Thus "." and /bob may be the same file but with two names ("." and bob), but the same inode. It is also possible that vi, view, ex are hard-links (same inode, different names on the same partition). # cd /tmp # ln a b (b is a hard link of a, an aliases but the same file) # ls a b a b # ls -i a b (print inode) users cannot create hard links for directories as this could make a graph with cycles links can only be made in the same file system as INODES are unique per file system, but not across file systems A soft link is a link that crosses partitions. A soft link is a file that actually contains the PATH to the file; e.g., (bob, /usr/mumble/x/y/z). Think of it as a file pointer. # df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s2a 99183 60686 30563 67% / /dev/ad0s2f 1703868 826975 740584 53% /usr /dev/ad0s2e 19815 10430 7800 57% /var /dev/ad0s1 3836084 2876036 960048 75% /dos procfs 4 4 0 100% /proc # cd /tmp # df . Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s2a 99183 60686 30563 67% / # ln -s /usr/src x # ls -l x lrwxr-xr-x 1 root wheel 8 Jul 20 10:59 x -> /usr/src file types ordinary files: regular file directory symlink devices: block-oriented device file (found in /dev) char device file (found in /dev) ipc/sockets: pipes and named pipes (aka FIFO) sockets (of the unix socket type) file descriptor and inode A file has a name in a directory, and an inode. A file does not contain EOF characters. all info for file other than name is in inode. inode: file type number of hard links file length in bytes device id (which device has file) inode number user id group id timestamps last change, last access, last modify access rights and file mode access-rights and file mode users of file have 3 classes: .user who owns file .user who is in group, but is not owner .everyone else three types of access rights: read/write/execute for each of previous 3 classes -rw-r--r-- 1 root wheel 900 Jun 8 07:59 /etc/passwd user/group/others uid gid 3 other kinds of rights, called suid - typically used to borrow root permission for the execution time of a program process gets uid of file owner. sgid - process executing file has GID of process group, (usually the shell), but in this case inherits GID from file. sticky - this flag is outdated in terms of memory mgmt, but may other functions on a particular kind of unix file-handling system calls data is actually stored in blocks in a partition on a hard drive BUT os must make it look like array of bytes we open a file for creation/read/read-write/append via open(2) system call: fd = open(path, flag, mode) path flag - how to open mode - access rights of newly created file fd is a handle, stored in process control block, that gives process access to internal os descriptor ... open file table file pointer (offset) stored in here. where to do i/o next. this is an integer. inode table note: creat(2) system call is not used much any more, and is overridden by expanded version of open(2). accessing opened file newoffset = lseek(fd, offset, whence) lseek can be used to position file pointer, followed by read/write. Thus two ways to do i/o. read/write - which always implicitly advanced open file table offset by returned amount. lseek to a certain spot read/write nread=read(fd, buf, count) nwrite=write(fd, buf, count) close(fd) 2 questions: 1. how does cat work? e.g., # cat /etc/passwd 2. how can you store a C structure in a file and use lseek to retrieve it ... Let's call this an indexed file scheme, and say we want to store 100 fixed-size records in a file. and then retrieve record 33. rename/delete of file res=rename(oldpath, newpath) res=unlink(pathname) in point of fact, unlink internally is how a file is deleted. The last unlink removes the file. Overview of unix kernels --------------------------- process/kernel model two modes: user mode/kernel mode. in user mode, program cannot directly access kernel code or data. when it switches to kernel mode via a system call, it can. kernel is not a process ... it is actually a shared text library. same thread of control in user mode/kernel mode TOP HALF. there are usually > 1 kernel thread that is started at boot, (or possibly via a loadable module) and that always execute in the kernel. note that sleeping (going from RUN state to SLEEP state) only occurs in the kernel. typically a process voluntarily gives up the CPU in the TH, and the scheduler will determine another process to wake up in the TH. the Bottom Half (note: linux has another meaning for this too) is interrupt driven. clock/network device/char device/file device have interrupts come in ... one form of interrupt is machine/cpu exception; e.g., program has made fatal bus error. kernel code is executed and hence we are in kernel mode. however: bottom-half usually has > machine hw priority than user mode/TH of kernel. device driver's (and the clock!) can't wait. process implementation: process has context, and this state must somehow be stored for one process to replace another. in some small/RT os, this might simply be the CPU registers and little else. process has general set of variables plus context including MMU setup in unix/linux. we save this in a process control block includes: 1. pc plus cpu registers 2. floating point 3. process control registeres 4. mmu note that former might define a thread where a process can have multiple threads, and process means 1 thread plus MMU address setup. to resume process A, we have to get the info above for proc A, put it back in the real hw, and start it up ... reentrant kernels reentrant - narrowly means a chunk of code that can bve executed by > 1 thread. we do not have data that is somehow tied to a single thread, stomped on by other threads, resulting in disaster int x; foo(int y) { x = y; } is not liable to be reentrant. why? how can we fix it? (2 ways) (1. make x local on the stack. 2. make x a global array of structures with some mechanism for allocation and handle creation). another approach: we can have nonreentrant functions and serialize their access to a shared resource via various mechanisms ... foo(int y) { while x is busy wait ... x = y; free x } kernel control path: set of instructions executed by kernel via: 1. system call 2. device driver interrupt some scenarios: process A is in user mode, executes system call traps into TH of kernel kernel decides that we cannot execute system call now calls scheduler schedule resumes process B later process A resumes executes system call code returns to user mode scenario 2: process A is in user mode accesses page not in memory page interrupt occurs kernel allocates page, perhaps loads it from file system returns A continues execution scenario 3: process A is in kernel mode in TH exec'ing system call blocks in TCP code on read as remote buffer not here yet later ... network device driver interrupt remote buffer enqueued on TCP input/socket queue TH is made ready to run, and reads buffer back out to user buffer returns to user mode FTP process does something with buffer process addr space ------------------ user processes run in separate virtual/private address space has private stack/data/code area. when in kernel mode, process uses kernel data/code and another stack reentrancy is helped here by the fact that every kernel process has separate stack if separate users on multi-user machine load emacs ... they share the executable pages, and do not share the data pages. processes can be programmed to share data memory via S5 shared memory techniques and mmap() system call. mmap maps "file" into memory so that it is part of address space. traditionally X server has used this so that server can NOT use kernel and system calls in order to twiddle display bits note that mmap(2) and open/close/read/write are very different ways to access a file. the question with mmap is how/when do you write memory BACK to a file? synchronization and critical regions: reentrancy in terms of multiple threads in the kernel AND multiple threads on multiple CPUs means we need locking ... critical regions, synchronization, etc. Often due to this problem: global cpu structure[N] process A wants to lock structure J process B wants to lock structure J maybe a file they both share ... race conditions can lead to a crashed system and worse a crashed hard disk CS theory: we need atomic operations for access to shared variable. however the shared variable might be "the cpu" or all of memory or the entire danged system ... in general as a matter of design, the smaller the variable, the better off you are BUT consider: kernel path A: lock the kernel and in fact lock all the CPUs do stuff ... unlock the kernel etc... this may occur and is far from ideal. why? in general we want the smallest amount of lock-out in terms of time in order to enhance parallelism (true) or mere concurrency. critical region: lock here ... <----------------------------critical region unlock here note we have synch. problems between TH and TH (two variations here, single CPU, and mult. CPU) TH and BH often TH/BH synchronization on a single CPU is dealt with by simply allowing the BH to interrupt at a higher priority. TH can BLOCK the BH by raising the CPU hw priority for a bit. th system call code: set processor level high do stuff set process level back low so intr can occur. most unix kernels are nonpreemptive in the TH TH must block (in sw call switch() function) to let another TH thread run this does not cover the case of SMP though. cpu1 ----> both can fight over global variable X cpu2 interrupt disabling TH blocks BH. on SMP system, this doesn't work because > TH, and TH1 must block TH2. i.e., the mechanism wasn't designed for that case. semaphores semaphore is shared memory/shared counting int we have: int var list of waiting processes down(), up() which are atomic down blocks if new value < 0 up increments semaphore and if new value >= 0 reactivates 1st in list data structure needs its own semaphore, typically set to 1. 1 means ONE RESOURCE IS AVAILABLE. process A/B/C down() on structure Z do stuff to Z up() spin locks semaphore may take more time than spin lock may have scheduling operation spin lock is busy wait, shared variable with no sleep Q. spin locks are useless in single cpu environment. useful in multi-CPU environment. what happens with traditional TH system call if getpid coded like following: getpid() while(1); rc = p->pid return(rc); avoiding deadlock process A has resource B and is waiting for resource A before it proceeds process B has resource A and is waiting for resource B before it proceeds deadlock may be COMPLEX and multi-process, and include multiple resources processes need a mechanism for serialization that does not lead to deadlock ... typically simplicity and coding conventions are one way to avoid this problem. although analytical techniques do exist users of synchronization need to be careful and KISS is a good design principle ----------------------- signals and IPC signals are a mechanism for notification of processes of events. They can be viewed as a simple form of IPC. They may be used kernel to process process to process they can be viewed as a primitive form of message passing (message class N, message is X) however there is no message, just the existance of the event. they are represented by small integers, 0 is not used. Signals may be asynchronous. often kernel to process. E.g., SIGINT sent to foreground process via Cntrl-C. POSIX defines 20 or so signals, two of which are user-configurable (used process to process) a process may ignore some signals (but not SIGKILL/SIGSTOP) async exec a function/signal handler (view signal as sw interrupt at app layer) kernel may terminate process if no signal handler exists write execution context into core/dump file and terminate the process ignore the signal suspend the process (job control) resume process execution after stopping it ATT S5 introduced ipc primitives too: 1. semaphore system 2. message queues 3. shared memory all use a integer key for process rendezvous. process management ------------------- fork() exec() close() wait() traditional shell does this to execute a program % ls fork() - create a new process, exact copy of shell exec() - overlay text/data/stack with new image (say ls) exit(0) - process terminates. SIGCHLD sent to parent. wait() - parent process waits for synchronous completion of child traditionally it was felt that fork() was expensive and various light-weight solutions were evolved over time ... vfork() - don't actually make a copy, just assume exec will happen soon. leads to shared text/data with parent. copy-on-write ... MMU capability. means at fork we can copy segmentation description, but we don't actually clone a text or data page until the exec'ing program tries to write to it. at that point, MMU sneaks in and replaces it. hw assignment: read linux clone(2) man page. zombie processes ----------------- when child exits ... its process control block is NOT released as it has exit status in it exit(1) ... 1 needs to be returned to parent via wait system call. also other info is possible ... a zombie is basically a process whose parent has not yet waited on it. code is gone, proc[] remains. over time, wait evolved ... waitpid() is ultimate result. waitpid() can wait for a specific child, or poll for any dead child. what happens if parent terminates before child due to error or on purpose? init - started at boot and has a number of tasks. one is to harvest dead great-grand-children. also starts boottime shellscripts and eventually terminal shells. init is the grandmother of all processes ... process groups and login sessions ----------------------------------- % ls | sort | more ... a pipe takes stdout and magically turns it into stdin for the next process. the above is a pipe-line. We can say that the above is also a "job" ... 3 processes glued together to do a new thing. these processes have a process group and process group leader. why? because we want to be able to send a SIGINT or SIGKILL to the group as a whole, not to each individual process. we also have login sessions. processes grouped according to the startup shell. processes in a process group must be in the same login session. one login session may have several process groups. one process group must be in the foreground (has terminal access), others may be in background (cut off from terminal access). % bg % fg used to control this. memory management ----------------- complex subject. virtual memory. rough idea: more "main" memory appears than actually exists in terms of ram. this is done in modern systems via paging in and paging out fixed-size chunks (say 4k) of programs. hence: > can run apps that are bigger than main memory. > can run > 1 app whose space in toto exceeds main memory > processes can share text (code) e.g., in dynamic/loadable libraries > programs must be relocateable ... loadable in a virtual space anywhere in real memory (they must not be coded to have hw'ed physical addresses) > consequently programs can be machine-independent and portable. MMU gives us virtual address space model. kernel/MMU make it seem to each process that it has its own virtual space. RAM is partitioned into 4/8k pages, and page tables are associated with the mapping of process text/data/stack into RAM. random access memory usage -------------------------- kernel lives in part of it. kernel text and static data. rest of it: kernel buffer space; e.g., i/o buffers. various caches (paging) process requests for memory, text/data/stack/memory map of files/shared mem fragmentation of memory is a problem: often kernel needs contiguous space. not just random chunks of space. kernel memory allocator: KMA ----------------------------- KMA - linux sub-system tries to allocate memory must be fast, minimize fragmentation, cooperate with other parts of memory management. linux version uses a slab allocator on top of a buddy system. process virtual address space handling -------------------------------------- kernel stores list of process VA memory area descriptors and at exec time puts together va areas for: .executable code .initialized data .uninitialized data .initial user mode stack .needed shared libs .heap space (malloc) demand paging is used. program may call malloc() which actually calls brk() system call may automagically grow stack swapping/caching ---------------- to extend virtual memory, we need swap space from a hard disk. page frames swapped to/from it. can view it as array of page frames. recently used pages are also a cache of pages and are often faster to use as they may still be in memory blocks read from disk and written to may be cached until periodic sync is done ... this can be a problem with ext2 file system. device drivers kernel talks to devices with device drivers. often but not always loadable modules. code is encapsulated in a module. device driver has kernel interface that is usually fixed ... and it must conform to it. types of device drivers: char-based ... read/write characters. block-based ... read/write blocks network-based ... read/write packets hw interface may vary ... much of complexity of device driver usually in hw interface mechanisms