Linux kernel internals book:

chapter 2: memory addressing

outline:
	IA segmentation overview
	how linux uses it
	paging and other hw
	how linux uses paging

8086 (Intel Architecture) oriented. 

memory mgmt:
chapter 2: memory addressing
chapter 7: how kernel allocates memory to itself
chapter 8: how "linear addresses" are assigned to processes

note: linear address is Intel-speak.

3 kinds of addresses:

1. logical address (segmented addressing)
	segmented address: address is sum 
		of which segment + which offset

	we load segmentation register, and instructions are relative
		to that segment (e.g., for code or data)

	this allows code relocation in memory.

2. linear address (virtual address)
	
	32 bit address, can address 4G.  2**32.  
	In hex:
		0x00000000 .. 0xffffffff

	virtual addresses can be viewed as being VERY big segments.

3. physical address 
	32 bit integers.  

	why would it be bad to encode physical addresses in a program?

traditionally we think of 2 POVs

	logical program address fed to
		segmentation registers
	which produce virtual address
		fed to paging hardware
	produces physical address

multiprocessor system > 1 cpu accesses all ram the same way

need memory arbiter between cpu/ram as ram must be accessed
	in a serial fashion ... grants access

dma is a form of memory arbiter, even in single-cpu system
	also moves data in parallel to cpu

segmentation

	386 did 
	1. real mode, used pre VM systems, still used at boot
		(init to that state, then setup mmu for protected mode)
	2. protected mode

logical address = segment part + offset part
	segment selector: 16 bit
	offset: 32 bit

6 seg. registers: cs, ss, ds, es, fs, gs
	only hold segment selector:

cs - code segment, points to code
	also contqains privilege level:
		0 - highest privilege (kernel mode)
		3 - lower privilege (user mode)
ss - stack segment, points to stack
ds - data points to data
other 3 are general purpose

segment descriptor:
	stored in either GDT - global descriptor table
				gdtr register points to it
			 LDT - local descriptor table
				ldtr points to it

	one global GDT,
	one LDT per process
	
segment descriptor:
	32 bits of base: linear address of 1st byte of segment
	20 bits of limit: page pointer, 4k pages 
		20 + 12 == 32
		therefore segment size between 4k..4G
	G granularity flag: if set to 0, granularity in bytes,
		else 4k units.
	S flag: if set, segment is system segment
	4 bit type field:
		code segment
		data segment
		task state segment, save area for register
		ldt descriptor
			GDT only, indicates segments points to LDT table
	
segment register (segment selector) -> table of segment descriptors
	-> segment/memory

mapping cached in non-programmable segment descriptor
	each time segment register loaded, segment descriptor also
		loaded -- provides faster access

	allows us avoid accessing GDT or LDT

segment selector: 
	13 bit segment id, points to segment descriptor
	bit to indicate GDT or LDT
	2 bit privilege level field (never mind)

segment id maps to descriptor by
	e.g., if 2 and GDT, then

	base of GDT address, plus 2 * 8

1st segment of GDT is null, which means kernel accesses to 0 should
	cause an exception

refer to figure 2-4:

	this is segmentation: how we turn logical address into
		virtual address.

	note: 
		segment/page/offset theory always important ...
	no matter what architecture

segmentation in linux

	theory: segmentation could allow 1 function/1 segment.
	This is a totally whacky theory. 
	history of unix os:  segments used for
		text
		data
		stack
		possible o.s. segment for per process os stack

	all the functions go in text (early unix on pdp-11)

	if you have enough bits ... segmentation can have large segments

	but swapping LARGE segments in/out is not *efficient*, thus swapping
		pages is favored.

	however, *segmentation* might be used for privilege/access protection
	e.g., why not make stack segment not-writable to prevent 
	buffer-overflow attacks

all processes use same logical addresses

kernel uses gdt ... in fact gdt has enough segments that
	we could probably just use it and get by.

kernel does not use LDTs, although user process can allocate
	with modify_ldt()

segments used by linux

kernel code segment
	can be read and executed
	all memory
kernel data segment
	can be read and written
	all memory

user code segment

user data segment
	
task state segment for each cpu
	array of registers basically

default ldt table

four segments out of GDT for APM support

---------------------------------------------------------

paging in hw

functions:
	translate linear address into physical address
	check permissions

pages are contiquous fixed-size checks (segments are variable sized)
	4k ... easy to write to/from primary and secondary storage

ram and swap space both partitioned into page frames

	0 4 k chunk
	1
	2
	..
	n

8086 paging address architecture, is 3 tuple

	directory + table + offset
	10	    10      12  (4k)  = 32 bits

	there are 2 address translation steps:

	1. via directory table, which is table of tables
	2. via table which is a set of pages
	3. offset is within page itself

why? because this is process context, and if we setup a process
	a priori to have a PTE (page table entry) it would be very
	costly in memory overhead

e.g.,
	say 12 bit offset (page), and then we need 2**20 of them
	to cover all 2**32 address space.

	with 10 entries apiece, each directory/table can have
		1k entries.  

hardware protection in paging

	read/write or readonly

3-level paging

	linux adopted this because it wants to work with 64-bit
		architectures.

	problem here is LOTS of memory, 3-level hierarchy insufficient.

	e.g., choose 16k page, therefore 14 bits per page.

	therefore offset is 14 bits,

	therefore rest of address is 50 bits.

	25 bits in 2 tables, means 2**25 possible, which is 32 million
		entries.  too big ...

	alpha is as follows:

		page frames: 2**13, 8k
		only use least significant 43 bits: leaves 30 bits.
		3 Levels of 10 bits apiece, 10 bits per table, 1k entries.

physical address extension paging mechanism (PAE)

386 to P1 havve 32 bit addresses but linear address limitations
	means kernel can only talk to 1 G, even though 4 G possible
 
P2 on allow 32 bit linear addresses to be translated into 36 bits
	== 64G

	new level of hierarchy introduced

	problem here is that programmable address space still only
		2**32

hardware cache

	cpu registers always faster than memory

	therefore we need to cache instructions/data chunks

	principle of locality important here

		code may have a loop ... 

	branches may be costly

	Intel introduced line, set of bytes transferred from
		DRAM to fast cache memory

	fully associative cache, means that line from DRAM
		can be put anywhere in it

	N-way associative ... line can only go in certain places
		reading is less tricky than writing
		write-through, write back to ram
		write-back, wait and update on interesting cache event,
			for example cache miss or flush

	multi-cpu, one cache per cpu

	cache snooping - if cpu mods its cache, may need to
		modify other cpu caches too

	in linux caching is enabled for all page frames,
		and write-back strategy is always used

translation lookaside buffers

	exist to speed up linear address translation

	very inefficient to always map pages from virtual to
		physical

	tlbs basically cache for this

in multi-cpu system, each cpu has its own TLB

paging in linux
----------------

linux uses 4-level hierarchy so that it can be portable to 64-bit
	architectures:

hw cr3 points to global register

global
	middle
		table
			offset

desirable to:
	1. assign different physical address space to each process,
	therefore minimize possibility of error

	2. distinguish pages (data) from page frames (addresses)
	thus we can load page N into frames X, Y, Z, as needed.

process has its own global directgory
	and own set of page tables

at process switch, linux saves cr3 in process table, loads new cr3

on pentium, only 2-level hw hierarchy, linux sets middle table to 0.
	however it is used with PAE mechanism.