Introduction

chapter notes from ORA linux kernel internals book:

history of unix rant:
	no details here, but ... there are other threads

freebsd/linux in various forms
	what do they have in common?
	FSF software for one thing
	X server/client for another
	See http://www.gnu.org
	open source ... what is open source SW?

how many variations on linux theme?
	debian
	RH
	mandrake

	can you offer up any differences

usoft vs open-source
	does closed-source prevent hackers from finding security bugs
	OR
	does open-source help coders remove security bugs
		so-called many eyes theorem

Linux - Linus Torvalds, 1991
	started with AST's minix, but he felt it was inadequate
	wanted open src os on x86
	others have ported it to other architectures

Linux is copy-lefted, under the FSF GNU license
	assignment: track it down and read it.
	how does "copyleft" different from copyright?
	how is it different from BSD src license 

kernel is at www.kernel.org
	linux distributions take it and other FSF code and
		add value for example

	RH installer.
	debian apt-get

peanut gallery time:
why doesn't debian use the RH installer?
why doesn't RH use apt-get?

linux attributes include:
	1. nonpreemptive kernel
		?: can we have multiple threads in kernel code on one cpu
		running "at the same time".  assume no to start with.
		2.4 is nonpreemptive in kernel mode, although there
		is a great deal of locking for SMP (but that's > 1 cpu).  
		2.5 removes this restriction.  

		This means that chunks of kernel code assume they
		have a cpu to themselves.

	2. multiprocessor support:  SMP.  question here is always
		how much serialization can the kernel support.
		Can one kernel thread lock out all the other kernel
		threads on multiple CPUs.

	3. file system
		has virtual file system a la Sun pioneered notions.
		Can't escape unix heritage!.
		ext2 - rough equivalent to old BSD UFS (in other words
		not a bad file system).  See Tso, etc. paper on it.
		ext3 - journaling file system, crash resistant.
		reiserfs - can deal well with small files.

	4. STREAMS - Came out of Bell Labs, notion of dividing device
		drivers up into separate stack-layer and piling them on
		top of each other.  linux doesn't have this.  contrary
		to book which is being polite, best to NOT have it.

many years of research in Distributed O.S. lead to client-server model
	as the architectural goal of an O.S. does linux care, no?

roughly monolithic acc. to AST, (monolithic or distributed as 2 major
	categories of arch. flavor for OS but)

linux does have a lot of loadable drivers instead ... more than FreeBSD
	in fact.  This helps somewhat to deal with o.s. complexity
	problems.

linux advantages include:

	1. free.  so just what is RH selling anyway?
	2. customizable.
	3. runs on cheap/old CPUs
	4. powerful
	5. high standard for src code quality
	6. kernel CAN be small and compact even though a given
		distribution may not be
	7. high degree of compatibility
		can mount various usoft file systems, other
		os file systems.  tcp/ip and ethernet etc.
		can execute some binaries compiled for other systems.
		(FreeBSD ironically can execute x86/linux binaries
		which makes the smaller open-source o.s. survive ...)
	8. well-supported
			1. usoft
			2. linux

	solaris/freebsd have a harder time of it

hw dependency:

	how do we hide hw dependencies in kernel source base:

	in /usr/src/linux we have arch/include directories

	arch e.g., has sub-directories for hw specific platforms including:

	alpha/arm/cris/i386/ia64/m68k/mips/mips64/parisc/ppc/sparc ... etc

linux versions

	2 branches of kernel:
		stable vs developmental

	version has 3 numbers:
		2.4.18  means:
		2.4 - version, 18 is release

	so for example, 2.4.x is stable, 2.5.x is developmental

	2.4 replaced 2.2 series
		2.3 turned into 2.4 at some point
	2.5 work started in nov 2001.


	stable is driven by features and bugs
		kernel algorithms and data structures in theory left alone

basic os system concepts

	os == kernel

	basic set of underlying capabilities that offers system-call
		based Virtual Machine to application processes

	2 objectives:

	1. interact with hw and shield apps from knowledge of hw.
	implements various abstractions like file, or process.

	2. provide execution environment to apps, "user programs"
	based on syscall API.

modern os do not allow app process to interact directly with hw; e.g.,
you cannot reprogram ethernet controller ... 

	done with basic hw privilege level (rings in Intel speak),
	you are either: kernel mode, or user mode. 

	We shall call these "modes".

examples:
	In kernel mode, we can execute a halt instruction.  In user
	mode we cannot.

	In kernel mode we can program a FD controller directly.

	In kernel mode we may be able to write to physical memory somehow,
	in user mode we operate in a Virtual Memory space.
	note in real life, kernel mode is probably virtual too after
	the MMU is turned on at boot.

multiuser systems

	two ideas here:

	multi-user system allows > 1 user at the same time.
		often have root/supervisor (with some kernel mode
		privileges) and user 1..N.
		call this multi-user.

	we may have one CPU, but we can have "virtual CPUs", else
		how do we have multiple users at the same time.
		process is abstraction of CPU state, and has process
		context, which must  be saved/restored by scheduler.
		call this multi-process.

	multi-user os features:

		. must authenticate user id.
		. protect one user process from another user process
		. possible account of user "work" and user units of
		storage, e.g., file system quotas.

UNIX always enforces hw privilege, for kernel vs user mode.

Note: kernel process switch means user mode becomes kernel mode; i.e.,
	we enter the kernel.

users/groups

	user has private space on machine; e.g., home user directory,
	and a "shell" process that starts at $HOME when you login.
	o.s. must enforce this idea.

	users id: User ID, or UID.

	we have user handle, and password associated with it,
		traditionally in /etc/passwd.

		jrb: ... some password ...

	to share files with other users, we have the concept of group.
	GID, a set of users with shared access rights.

	UNIX always has root/supervisor.  system admin must login as
	root, to do certain tasks (e.g., reboot, install os, upgrade
	programs)

	root CAN do 

	# rm -fr /

	for better or worse ...

	# su program allows user to become root, or root to become
	user (depending on permissions in the 1st case)

processes

	process is virtual cpu and fundamental abstraction.

	process has context:  at heart this is set of registers
		for cpu, plus MMU register setup.  larger notion:
		kernel variables (e.g., current working directory)
		for a given process.

	scheduler may schedule another process to run if current
		running process uses up cpu time quantum

	multiprogramming - concurrent processes, 1 cpu (or 1..n cpus)

	if we only have 1 cpu, we can only have 1 process running
		in real life at a time, although others may
		appear to be running.

	processes on UNIX are typically preemptible; i.e., scheduler
		can kick them out.

	top-half/bottom-half rant here ...

	kernel is shared library ... and is non-preemptive in kernel mode.
	i.e., 

	user thread mode switches into kernel

	run in kernel TH until it gives us CPU  voluntarily
		call this a context switch or coroutine function.

	process A in TH calls switch()
	switch()
		save my context
		scheduler chooses better process control block
			restores its context

		process B returns from switch()
	
kernel architecture

	unix is monolithic in AST sense.

	microkernel idea associated with DOS architectures.
		kernel might consist of just

		message passing primitives (send/recv)
		scheduler
		intr handlers

	rest of os (file/memory/process/device) is server ... sends
		messages to implement system calls.

	monolithic architecture does not have "servers" and microkernel.

	Instead system-calls TRAP (software intr) with user->kernel mode
	switch into TH of kernel.

	linux tries to get DOS benefits via modularity and loadable drivers.
	loadable driver idea based on runtime dynamic linking.

	module advantages include:

	1. modularized approach!
	2. platform independence
	3. frugal main memory 
	4. no performance penality due to message passing.
	loadable module is in the kernel

file system overview

files

1.
a file conceptually is an array of bytes, from 0..max-1.
the kernel does not interpret it (we do not have "database" files
with some concept of a variable-length or fixed-length record with keys).

2. files live in directory name space that starts with root.
We have root directory
at the start of the tree (/).  We may have the same name in different
directories.  

3. each process has a CWD.  pathnames are relative to
the CWD.  or absolute (not relative to CWD).  

4. the process uses a pathname to id the file.  

5. . and .. may be used.  . means here, and .. means in the directory
up one, the parent directory.

hard/soft links

Actually traditionally links are directory entries.  However due to
inodes it is possible for a file to have two "names" (links) in different
directories.  This is fundamentally how the file system is glued together
as a directory is a file with 2-tuples of (name, inode-number).
Thus "." and /bob may be the same file but with two names ("." and bob),
but the same inode.  It is also possible that 
vi, view, ex are hard-links (same inode, different names on the same partition).

# cd /tmp
# ln a b (b is a hard link of a, an aliases but the same file)
# ls a b
	a b 
# ls -i a b  (print inode)

users cannot create hard links for directories as this could
	make a graph with cycles

links can only be made in the same file system as INODES are unique
	per file system,  but not across file systems

A soft link is a link that crosses partitions.  A soft link is a file
	that actually contains the PATH to the file; e.g.,
	(bob, /usr/mumble/x/y/z).  Think of it as a file pointer.

# df
Filesystem  1K-blocks    Used  Avail Capacity  Mounted on
/dev/ad0s2a     99183   60686  30563    67%    /
/dev/ad0s2f   1703868  826975 740584    53%    /usr
/dev/ad0s2e     19815   10430   7800    57%    /var
/dev/ad0s1    3836084 2876036 960048    75%    /dos
procfs              4       4      0   100%    /proc

# cd /tmp
# df .
Filesystem  1K-blocks  Used Avail Capacity  Mounted on
/dev/ad0s2a     99183 60686 30563    67%    /
# ln -s /usr/src x
# ls -l x
lrwxr-xr-x  1 root  wheel  8 Jul 20 10:59 x -> /usr/src

file types

ordinary files:
	regular file
	directory
	symlink

devices:
	block-oriented device file (found in /dev)
	char device file (found in /dev)

ipc/sockets:
	pipes and named pipes (aka FIFO)
	sockets (of the unix socket type)

file descriptor and inode

A file has a name in a directory, and an inode.  A file
	does not contain EOF characters.  

all info for file other than name is in inode.  

	inode:
	file type
	number of hard links
	file length in bytes
	device id (which device has file)
	inode number
	user id
	group id
	timestamps
		last change, last access, last modify
	access rights and file mode

access-rights and file mode

users of file have 3 classes:

	.user who owns file
	.user who is in group, but is not owner
	.everyone else

three types of access rights:
	read/write/execute  for each of previous 3 classes
	
-rw-r--r--  1 root  wheel  900 Jun  8 07:59 /etc/passwd
user/group/others  uid  gid

3 other kinds of rights, called

	suid - typically used to borrow root permission for
		the execution time of a program
		process gets uid of file owner.
	sgid - process executing file has GID of process group,
		(usually the shell), but in this case inherits GID
		from file.
	sticky - this flag is outdated in terms of memory mgmt,
		but may other functions on a particular kind of unix

file-handling system calls

data is actually stored in blocks in a partition on a hard drive
BUT os must make it look like array of bytes

we open a file for creation/read/read-write/append via open(2)
	system call:

	fd = open(path, flag, mode)

	path
	flag - how to open
	mode - access rights of newly created file

	fd is a handle, stored in process control block, that
	gives process access to internal os descriptor ...

	open file table 
		file pointer (offset) stored in here.  where
		to do i/o next.  this is an integer.
	inode table 

	note: creat(2) system call is not used much any more,
	and is overridden by expanded version of open(2).

accessing opened file

	newoffset = lseek(fd, offset, whence)

	lseek can be used to position file pointer, followed
	by read/write.

	Thus two ways to do i/o.

	read/write - which always implicitly advanced open file
		table offset by returned amount.

	lseek to a certain spot
	read/write

	nread=read(fd, buf, count)
	nwrite=write(fd, buf, count)

	close(fd)

2 questions:

	1. how does cat work? e.g.,

	# cat /etc/passwd

	2. how can you store a C structure in a file and use lseek
	to retrieve it ... Let's call this an indexed file scheme,
	and say we want to store 100 fixed-size records in a file.
	and then retrieve record 33.

rename/delete of file

	res=rename(oldpath, newpath)
	res=unlink(pathname)

	in point of fact, unlink internally is how a file is deleted.
	The last unlink removes the file.

Overview of unix kernels
---------------------------

process/kernel model

two modes: user mode/kernel mode.

	in user mode, program cannot directly access kernel code or data.
	when it switches to kernel mode via a system call, it can.

	kernel is not a process ... it is actually a shared text library.

	same thread of control in user mode/kernel mode TOP HALF.

	there are usually > 1 kernel thread that is started at boot,
	(or possibly via a loadable module) and that always execute
	in the kernel.

	note that sleeping (going from RUN state to SLEEP state) only
	occurs in the kernel. 

	typically a process voluntarily gives up the CPU in the TH,
		and the scheduler will determine another process to
		wake up in the TH.

	the Bottom Half (note: linux has another meaning for this too)
	is interrupt driven.  clock/network device/char device/file device
	have interrupts come in ... 

	one form of interrupt is machine/cpu exception; e.g., program
	has made fatal bus error.  kernel code is executed and hence
	we are in kernel mode.

	however: bottom-half usually has > machine hw priority than
		user mode/TH of kernel.

	device driver's (and the clock!) can't wait.

process implementation: 

	process has context, and this state must somehow be stored
	for one process to replace another.  in some small/RT os, this
	might simply be the CPU registers and little else.

	process has general set of variables plus context including MMU
		setup in unix/linux.

	we save this in a process control block includes:

	1. pc plus cpu registers
	2. floating point
	3. process control registeres
	4. mmu

	note that former might define a thread where a process
	can have multiple threads, and process means 1 thread
	plus MMU address setup.

	to resume process A, we have to get the info above for proc A,
	put it back in the real hw, and start it up ...

reentrant kernels

	reentrant - narrowly means a chunk of code that can bve
		executed by > 1 thread.  
 		we do not have data that is somehow tied to a single
		thread, stomped on by other threads, resulting in disaster

int x;

foo(int y)
{
	x = y; 
}

	is not liable to be reentrant.  why?  how can we fix it?
	(2 ways) (1. make x local on the stack.  2. make x a global
	array of structures with some mechanism for allocation and
	handle creation).

another approach:

	we can have nonreentrant functions and serialize their access
	to a shared resource via various mechanisms ... 

foo(int y)
{
	while x is busy
		wait ...
	x = y;

	free x
}

kernel control path:
	set of instructions executed by kernel via:

	1. system call
	2. device driver interrupt
	
some scenarios:
process A is in user mode, executes system call

traps into TH of kernel

kernel decides that we cannot execute system call now

calls scheduler

schedule resumes process B

later process A resumes

executes system call code

returns to user mode

scenario 2:

process A is in user mode

accesses page not in memory

page interrupt occurs

	kernel allocates page, perhaps loads it from file system

	returns

A continues execution

scenario 3:

process A is in kernel mode in TH exec'ing system call 

blocks in TCP code on read as remote buffer not here yet

later ...

network device driver interrupt

remote buffer enqueued on TCP input/socket queue

TH is made ready to run, and reads buffer back out to user buffer

returns to user mode

FTP process does something with buffer 

	
process addr space
------------------

user processes run in separate virtual/private address space 
has private stack/data/code area.

when in kernel mode, process uses kernel data/code and another stack

reentrancy is helped here by the fact that every kernel process
	has separate stack

if separate users on multi-user machine load emacs ...
they share the executable pages, and do not share the data pages.

processes can be programmed to share data memory via S5 shared memory
	techniques and mmap() system call.

mmap maps "file" into memory so that it is part of address space.
traditionally X server has used this so that server can NOT use
kernel and system calls in order to twiddle display bits

note that mmap(2) and open/close/read/write are very different
	ways to access a file.  the question with mmap is how/when
	do you write memory BACK to a file?

synchronization and critical regions:

reentrancy in terms of multiple threads in the kernel AND multiple
	threads on multiple CPUs means we need locking ...
	critical regions, synchronization, etc.

Often due to this problem:

global cpu structure[N]

process A
	wants to lock structure J

process B
	wants to lock structure J

maybe a file they both share ...

race conditions can lead to a crashed system and worse a crashed
	hard disk

CS theory:
	we need atomic operations for access to shared variable.

	however the shared variable might be "the cpu" or all of memory
	or the entire danged system ... 

	in general as a matter of design, the smaller the variable,
	the better off you are BUT consider:

kernel path A:

	lock the kernel and in fact lock all the CPUs

	do stuff ...

	unlock the kernel etc...

this may occur and is far from ideal.  why? 

	in general we want the smallest amount of lock-out in terms
	of time in order to enhance parallelism (true) or mere concurrency.

critical region:

	lock here

	... <----------------------------critical region

	unlock here

note we have synch. problems between

	TH and TH  (two variations here, single CPU, and mult. CPU)
	TH and BH

often TH/BH synchronization on a single CPU is dealt with by simply
	allowing the BH to interrupt at a higher priority.

	TH can BLOCK the BH by raising the CPU hw priority for a bit.

th system call code:

	set processor level high

	do stuff

	set process level back low so intr can occur.

most unix kernels are nonpreemptive in the TH

	TH must block (in sw call switch() function) to let another
		TH thread run

	this does not cover the case of SMP though.

	cpu1
		---->  both can fight over global variable X
	cpu2

interrupt disabling

	TH blocks BH.

	on SMP system, this doesn't work because > TH, and TH1
	must block TH2.  i.e., the mechanism wasn't designed for
	that case.

semaphores

	semaphore is shared memory/shared counting int

	we have:
		int var
		list of waiting processes
		down(), up() which are atomic

	down blocks if new value < 0
	up increments semaphore and if new value >= 0 reactivates
		1st in list

	data structure needs its own semaphore, typically set to 1.
	1 means ONE RESOURCE IS AVAILABLE.

process A/B/C

	down() on structure Z
	do stuff to Z
	up()

spin locks

	semaphore may take more time than spin lock
		may have scheduling operation 

	spin lock is busy wait, shared variable with no sleep Q.
	
	spin locks are useless in single cpu environment.  useful
		in multi-CPU environment.

	what happens with traditional TH system call if getpid
	coded like following:

getpid()

	while(1);
	rc = p->pid
	return(rc);

avoiding deadlock

	process A has resource B and is waiting for resource A
		before it proceeds
	process B has resource A and is waiting for resource B
		before it proceeds

	deadlock may be COMPLEX and multi-process, and include
		multiple resources

	processes need a mechanism for serialization that does
	not lead to deadlock ... 

	typically simplicity and coding conventions are one way
		to avoid this problem.  

	although analytical techniques do exist

	users of synchronization need to be careful and KISS
		is a good design principle

----------------------- 
signals and IPC

signals are a mechanism for notification of processes of events.
They can be viewed as a simple form of IPC.  They may be used

	kernel to process
	process to process

they can be viewed as a primitive form of message passing
	(message class N,  message is X)
however there is no message, just the existance of the event.

they are represented by small integers, 0 is not used.

Signals may be asynchronous. often kernel to process.  E.g.,
	SIGINT sent to foreground process via Cntrl-C.

POSIX defines 20 or so signals, two of which are user-configurable
	(used process to process)

a process may 
	ignore some signals (but not SIGKILL/SIGSTOP)
	async exec a function/signal handler (view signal as
		sw interrupt at app layer)

kernel may
	terminate process if no signal handler exists
	write execution context into core/dump file and terminate
		the process 
	ignore the signal
	suspend the process (job control)
	resume process execution after stopping it

ATT S5 introduced ipc primitives too:

	1. semaphore system
	2. message queues
	3. shared memory

all use a integer key for process rendezvous.

process management
-------------------

fork()
exec()
close()
wait()

traditional shell does this to execute a program
% ls

	fork() - create a new process, exact copy of shell
		exec() - overlay text/data/stack with new image
			(say ls)
		exit(0) - process terminates.  SIGCHLD sent to
			parent.
	wait() - parent process waits for synchronous completion
		of child

traditionally it was felt that fork() was expensive and various
light-weight solutions were evolved over time ... 

vfork() - don't actually make a copy, just assume exec will happen
	soon.  leads to shared text/data with parent.

copy-on-write ... MMU capability. means at fork we can copy
	segmentation description, but we don't actually
	clone a text or data page until the exec'ing program
	tries to write to it.  at that point, MMU sneaks
	in and replaces it.  

hw assignment: read linux clone(2) man page.

zombie processes
-----------------

when child exits ... its process control block is NOT released
	as it has exit status in it  

	exit(1) ... 1 needs to be returned to parent via wait
		system call.  also other info is possible ...

a zombie is basically a process whose parent has not yet waited on it.
	code is gone, proc[] remains.

over time, wait evolved ... waitpid() is ultimate result.
waitpid() can wait for a specific child, or poll for any dead child.

what happens if parent terminates before child due to error 
	or on purpose?

init - started at boot and has a number of tasks.  one is
	to harvest dead great-grand-children.

	also starts boottime shellscripts
	and eventually terminal shells.
	init is the grandmother of all processes ...


process groups and login sessions
-----------------------------------

% ls | sort | more

... a pipe takes stdout and magically turns it into
stdin for the next process.  the above is a pipe-line.

We can say that the above is also a "job" ... 3 processes
glued together to do a new thing.    these processes
have a process group and process group leader.

why? because we want to be able to send a SIGINT or SIGKILL
to the group as a whole, not to each individual process.

we also have login sessions.  processes grouped according to
the startup shell.  processes in a process group must be in the
same login session.  one login session may have several process groups.
one process group must be in the foreground (has terminal access),
others may be in background (cut off from terminal access).

% bg
% fg

used to control this.

memory management
-----------------

complex subject.  

virtual memory.
	rough idea: more "main" memory appears than actually exists
		in terms of ram.  this is done in modern systems via
		paging in and paging out fixed-size chunks (say 4k)
		of programs. 

	hence:

	> can run apps that are bigger than main memory.
	> can run > 1 app whose space in toto exceeds main memory
	> processes can share text (code) e.g., in dynamic/loadable
		libraries
	> programs must be relocateable ... loadable in a virtual
	space anywhere in real memory (they must not be coded
	to have hw'ed physical addresses)
	> consequently programs can be machine-independent and portable.

MMU gives us virtual address space model.  kernel/MMU make it seem
	to each process that it has its own virtual space.

RAM is partitioned into 4/8k pages, and page tables are associated
	with the mapping of process text/data/stack into RAM.

random access memory usage
--------------------------

kernel lives in part of it.  kernel text and static data.

rest of it:

kernel buffer space; e.g., i/o buffers.
	various caches (paging)

process requests for memory, text/data/stack/memory map of files/shared mem

fragmentation of memory is a problem: often kernel needs contiguous space.
	not just random chunks of space.

kernel memory allocator: KMA
-----------------------------

KMA - linux sub-system tries to allocate memory

	must be fast, minimize fragmentation, cooperate with other
		parts of memory management.

linux version uses a slab allocator on top of a buddy system.

process virtual address space handling
--------------------------------------

kernel stores list of process VA memory area descriptors and
at exec time puts together va areas for:

	.executable code
	.initialized data
	.uninitialized data
	.initial user mode stack
	.needed shared libs
	.heap space (malloc)

demand paging is used. 
	program may call malloc() which actually calls brk() system call
	may automagically grow stack

swapping/caching
----------------

	to extend virtual memory, we need swap space from a hard disk.
	page frames swapped to/from it.  
	can view it as array of page frames.


	recently used pages are also a cache of pages and are often
		faster to use as they may still be in memory

	blocks read from disk and written to may be cached until
		periodic sync is done ... this can be a problem with
		ext2 file system.

device drivers

	kernel talks to devices with device drivers.  often but not
	always loadable modules.  

	code is encapsulated in a module.

	device driver has kernel interface that is usually
		fixed ... and it must conform to it.

	types of device drivers:

		char-based ... read/write characters.
		block-based ... read/write blocks
		network-based ... read/write packets

	hw interface may vary ... much of complexity of 
		device driver usually in hw interface mechanisms