Chapter 4 - Debugging Techniques

kernel code debugging can be tricky

consider:
	how do we debug top-half code from a user process?
		mode switch ... gets in the way.
	interrupt-side is an even worse question: 
		plus it's async.
	top-half vs bottom-half interactions events/races
	races from interrupts/SMP (> 1 cpu)

	how exactly do you think one user process can
		debug another user process anyway?
		(man ptrace).  

approaches then are:
1. debugging by printing
2. debugging by querying (ask the kernel for info)
3. debugging by watching (e.g., watch behavior of user process
4. debugging by READING THE ERROR MESSAGES (by paying attention)
		oops - null pointer reference
5. kernel debuggers and other tools

Various kernel config options exist that may be of help.

See pp 73-75.  note CONFIG_DEBUG_INFO for example.

--------------------------------------
debugging by printing

printk

	loglevels/priorities included with messages

macros:

from highest/worst to lowest/info

KERN_EMERG		high priority		low number
KERN_ALERT
KERN_CRIT
KERN_ERR
KERN_WARNING
KERN_NOTICE
KERN_INFO
KERN_DEBUG		lo priority		hi number

<linux/kernel.h>

#define KERN_EMERG      "<0>"   /* system is unusable                   */
#define KERN_ALERT      "<1>"   /* action must be taken immediately     */
#define KERN_CRIT       "<2>"   /* critical conditions                  */
#define KERN_ERR        "<3>"   /* error conditions                     */
#define KERN_WARNING    "<4>"   /* warning conditions                   */
#define KERN_NOTICE     "<5>"   /* normal but significant condition     */
#define KERN_INFO       "<6>"   /* informational                        */
#define KERN_DEBUG      "<7>"   /* debug-level messages                 */

   printk(KERN_EMERG "System halted.\n");

no priority means a default is used.  This is not a good idea.
default is: DEFAULT_MESSAGE_LOGLEVEL, which is probably KERN_WARNING.

int in kernel
console_loglevel value determines whether messages go to
	the virtual console or not.  

note logging system consists of both klogd/syslogd on linux ...
	syslogd config may mean messages go to another computer entirely.
	messages are probably stored in /var/log/messages 
	this is unrelated to console_loglevel

console_loglevel may be adjusted by using klogd -n <value>

or by setting kernel.printk value (4 values) or

[kernel]# sysctl -a | grep print
kernel.printk = 1       4       1       7

setting /proc value
[kernel]# cat /proc/sys/kernel/printk

1       4       1       7

all kernel messages will go to console
echo 8 > /proc/sys/kernel/printk 

[root@zymurgy kernel]# sysctl -a |grep printk
kernel.printk = 8       4       1       7

first is console loglevel.  2nd is default level for
		messages.

-------------------------------------------------
How messages get logged

	printk writes messages into circular queue in kernel.
		at some point they will wrap.

	therefore this means what about bottom-half printk writes?

	wakes up any processes blocked on reading that ...

		sleeping on syslog(2) or reading /proc/kmsg

	klogd likely reads /proc/kmsg

	dmesg is use app that simply dumps the queue.

	# dmesg

	# less /var/log/messages to see what klogd did with
		the messages read from the kernel (or written
		to the console or the network)

	btw: printk can be used in a interrupt handler
	----------------------------------------------

	klogd reads (and consumes) kernel messages from the buffer
		hands them off to syslogd which checks
		/etc/syslog.conf to decide what to do about the situation ...
		may write locally to syslog
		may use UDP syslog port/protocol to ship messages off system.
		UDP port 514  (TCP port 514 is old shell)

	it may be worthwhile to make the buffer bigger for debug.

turning the messages on/off

	1. it is a good idea to take a moment and think about the design
	of your printk/debug architecture.

	2. it is a good idea to NOT REMOVE IT when "done". as there may
	be bugs later.

	two approaches, macro or C code branch test

	macro:

	see p. 80

	#undef PDEBUG
	#ifdef SCULL_DEBUG
	#ifdef __KERNEL__	# in case we are in the kernel
	#define PDEBUG(fmt, args...) printk(KERN_DEBUG "scull: " fmt, ## args)
	#else # user space 
	#define PDEBUG(fmt, args...) fprintf(stderr, fmt, ## args)
	#endif
	#else
	#define PDEBUG(fmt, args...) /* it's off jim */
	#endif

so if we compile with -DSCULL_DEBUG we turn PDEBUG on in our
	code

See Makefile code, p. 102

	note: we need a gcc compiler with extensions to support
		macros with variable arguments.  GNU make
		is also needed.

	code/branch

	int debug = 0;  /* or 1 */
	could be input parameter

	#ifdef DEBUG
	if (debug) {
		print ...
	}
	#endif

what are the tradeoffs between these two approaches?

Rate Limiting:

	too many debug messages may overflow console/buffer, etc.

	possible problem scenario:

	you have a fatal error message:  my network card is KERFLUUEY!

	you receive 100000 interrupts because you are being attacked ...

	(formally the problem of livelock is being exacerbated because
	every interrupt causes a print message and klogd/syslogd to try
	to log a message to a file).

	1. you can set a flag in the driver to say:  I have complained about
	this, and I don't need to do it twice (or 5 times).

	2. kernel has printk_ratelimit(void);

	if (printk_ratelimit()) {
		printk(KERN_NOTICE "The printer is still on fire\n");
	}

	if a threshold (settable with sysctl) is exceeded, messages are dropped.

	or

	/proc/sys/kernel/printk_ratelimit - # of secs to wait before reenabling msgs
	/proc/sys/kernel/printk_ratelimit_burst - # of msgs accepted before rate limiting.

Printing device numbers

	<linux/kdev_t.h> has utility macros for major/minor #.

	see p. 82

	int print_dev_t takes dev 2-tuple and encodes in string buffer.
		returns count of chars in buffer.

	char *format_dev_t takes dev 2-tuple and encodes in string buffer.
		returns modified buffer

	buffers should be provided, and should be at least 20 bytes long

Debugging by querying

	EVERY LINE PRINTED CAUSES FILE I/O (or network i/o ...)

	printk can slow down the system so much that race bugs
		are masked.

	device driver writer knows this rule:

	it ain't debugged 'til the printfs are *off* ...

	so one might query the system (pull rather than push) as opposed
	to having the system blab on every event

	"just the facts, ma'am".

	three ideas here:
 
	1. put a "dynamic" file in /proc somewhere; that is,
	you write out the values of variables.   

	e.g.,
	cd /proc/net/arp

	ps/top/uptime and many other utilities read /proc
	for kernel state

	cat arp ... you get the arp table ... (arp(8) uses this)

	driver can create read-only proc file ... function
	is called in driver when /proc/mycode is read.

	see create_proc_read_entry, p. 105 for registration
		of a function to be called when file is read

	see read_proc entry point p. 104. for the function
		itself

	2. use ioctls (to do about anything ... this is a device
	driver entry point, and you can put any kind of state machine
	you want in with ioctls for moving data to/from the kernel.

	3. use sysfs, see chap. 14.

using the /proc filesystem

	/proc exists for kernel to export info to world.

	may import in some cases (read/write, but assume readonly)

	entries may be dynamically inserted by modules, removed at rmmod time

	new recommended way is believed to be sysfs, as /proc is thought to be a mess.

	too late ...

	<linux/proc_fs.h>

	overview:

	1. you must register a proc file
	2. you must delete your proc files when done
	3. you must pass a callback function in at registration that will process
		i/o

	when a read occurs from proc device, kernel allocates a PAGE of memory,
	and calls your registered:

	read_proc function,

	see p. 84

	function returns directly: # of bytes output.

	*start is NULL to start with,  you advance it if and only if you have a logical
		offset.

	see code on p. 85

creating your own /proc file

	see code p. 86, create_proc_read_entry registration function

	note deregistration routine as well.

	proc problems include:

		. no reference count, therefore /proc file can be removed when in use.

		. no attempt to distinquish two instances ...

seq_file interface

	implementation of large file dumps is harder.

	seq_file assumes you are stepping thru a large file by a sequence of items

	you must create "iterator" object, see p. 87 for details.

	need 4 iterator methods:

	1. start
	2. next
	3. stop
	4. show
	
	how a sequence is implemented is up to the module.  bytes/structures whatever.
	book sez: may be viewed as a "cursor" on a byte array.

	use this if you have more complex data structures or more output

the ioctl method

	can easily read/write int variables or structures too and from the kernel.

	app program must be written and kept in sync with the kernel

	trade-off:

	ioctl is less obvious than /proc

debugging by watching. p. 108

	basic idea: track behavior of user-level apps; i.e., 
		watch them.

	app-level printfs/debug/syscall-call tracing all possible methodologies

	strace command.  prints out info about system calls.
	Note you may need to install it.

	you may see errors on calling or on return values.

	try
	
	# strace ls

	find the open where "." is opened (note you may have any amount
	of strange load module init syscalls)

	# strace cat

	strace gets info from the kernel and therefore a prog. need not be
	compiled for debug.

	bottom line: strace useful for runtime syscall errors.  perror(3)
	may not be used or not verbose enough.

debugging system faults

	driver errors can result in system faults

	not likely to be a panic (fatal error)

	oops message possible.

	e.g., null pointer dereference will cause an oops
	message.

	or "stackicide"

	char *f;

	f = 0;
	*f = 1;

	page fault generated and oops dump occurs

	eip is likely most useful info.  

	and note: info in /var/log/messages

	see code p. 94 for examples of faults

	run the faulty module

	problem: if we are given a hex address,
	how to map hex address to C-level?

	1. 
	note: gdb can disassemble/dump instructions

	e.g., gdb can do
	(gdb) x/20i  (print as instructions ...)

	OR

	(gdb) x/20i fault_write

This GDB was configured as "i386-redhat-linux"...(no debugging symbols found)...
(gdb) x/20i
0x0 <fault_write>:      push   %ebp
0x1 <fault_write+1>:    mov    %esp,%ebp
0x3 <fault_write+3>:    mov    $0x0,%eax
	*********************************************** previous was the 
		instruction
0x8 <fault_write+8>:    pop    %ebp
0x9 <fault_write+9>:    ret    
0xa <fault_write+10>:   mov    %esi,%esi

or use the compiler for that matter

	cc -S foo.c gives foo.s
		
note: use the same make incantation else you don't get the same code
	of course.

next:
	stackicide example on p. 95

	note how oops message is different.
	note stack dump: EIP have ffffffff in them.  how
	did that get there? 

kernel must be built with CONFIG_KALLSYMS option to get
	function backtrace

system hangs:

	the system may experience a complete hang; aka "silent death".

	silent death is no fun at all.

	what happens if your module does this:

	while(1)
		;

	adding printk is perhaps the best option
		but WRITE TO THE CONSOLE  (you don't want buffering here)
		change the console_loglevel value

	though if in industry you might have to use a logic analyzer

	you can use schedule call or if lucky enough to be on SMP
		system, not totally toast   

	hint: system may hang keyboard or somesuch, but you can
		usually tell if the kernel is still scheduling
		or still working at the interrupt level.

		1. put up an xclock for the former
		2. ping it for the latter

	there can be a magic sysrequest (Alt-somekey) capability
		compiled into the kernel.  You must do this yourself
		(make it be in the kernel).

		see p. 97 for more info.

	sometimes a good idea to NOT have the X server enabled.

	file systems: journaling file systems really a good idea ...
		ext3/reiserfs > ext2

	NFS remote mount also good because your file system syncronization
	is in the remote NFS server, not your kernel

Debuggers/etc:
	
	possible flavors:;

	1.  use gdb to look at kernel.  can't set breakpoints, etc.
	can read variables (need to know some tricks).   You
	use the core dump feature, but it isn't a core dump.
	It's a magic kernel /proc file that simulates a core dump.

	2. kdb kernel debugger.  low-level assembly monitor.

	3. "kgdb" full kernel debug with remote system.  must
	use serial connection or TCP connection.  gdb "attach"
	function is used. (remote kernel is similar to attaching
	to a process that was not spawned by the debugger).

	4. odds/ends

gdb "core" debug option

	compile the kernel for debug.
	CONFIG_DEBUG_INFO

	# gdb /usr/src/linux/vmlinux /proc/kcore

	vmlinux is not the compressed boot version.  It
	is needed for symbol table info.

	gdb symbol_table  magic_memory_file

	magic_memory_file may be live kernel or dead core
	in the real world.


	/usr/src/linux/vmlinux is uncompressed kernel

	/proc/kcore - "core file", but in this case the kernel
		is live.

	kernel should be compiled with gdb and not stripped of symbols

	gdb does not have the normal abilities if a single computer
		is being used.
		cannot set breakpoints

	can disassemble

	you can use objdump foo.o to do this as well

	(gdb) p jiffies  etc. should work BUT.  can print structures,
	follow pointers.  

	due to the fact this is for core debug, gdb caches data, therefore
	live kernel variables may be cached (and wrong).

	(gdb) core-file /proc/kcore to flush gdb cache

	to load a loadable module into the debugger,
	see pp. 100-101

	1. debugger must be taught where in kernel address space
	module's text/data have been loaded.  need base addresses.

	/sys/module under /sysfs

	load scull -> /sys/module/scull/sections contains this info

	2. load gdb incantation as shown on p. 101
	
kdb kernel debugger

	on BSD you can use this, but on linux may not be available.

	machine-level debugger

	can get as patch from oss.sgi.com

	magic key (pause/break) gets you into the debugger

	do not leave this debugger/key in for "customer" systems

kgdb patch

	debug via 2 computers over serial line (this is the full meal
		deal)

	or over TCP for delux version

	full gdb debug breakpoints/single-step etc.

crash dump analyzers

	traditional unix kernel tool.

	kernel patches required.

	idea: kernel crash(oops) occurs, write memory/registers info
		to disk space (swap ...) you specify in advance

	at boot, you recover info and examine it.

user-mode linux

	runs on virtual linux machine
		process on top of that

	easy to use gdb

	has limited access to hardware (and how could you debug
		race conditions?)
 
dynamic probes

	debug tool allows "probe" (code in a specialized language)
	to be put anywhere in kernel.

	arbitrary code executed when you hit a certain execution point

	http://oss.software.ibm.com