Go to the previous, next section.

Database Design

@everyfooting Author: rootd // Editor: ensley // Texinfo: rootd @| @| 3 December 1994

The speed of the searcher is dependent on the format of the database. In addition, since our users can search for arbitrary substrings, we can't use hash tables effectively.

Remember that users can only search for filenames. Users cannot search for the other information on the ls-lR data line (such as the owner of a file). Note that every directory is also a file, so the file /usr/local/bin/X11 will not be scored as a "hit" if someone searches for "local" but the fact that there is a /usr/local/bin/X11 file implies that elsewhere in the filesystem there is a /usr/local file (which is actually a directory) which will be scored as a "hit".

Our goal is to make a small file that can be searched through easily, and have the output from that search allow us to quickly find all the information we need for our printout.

Information Contained in the Database

Here is an example of an ls -lR on a unix system:

total 160
drwxr-xr-x  3 rootd         512 Nov 22 16:27 blorf
-rwx--r-x  1 rootd       41342 Oct 23 19:34 texi2html

blorf:
total 1
-rw-r--r--  1 rootd           0 Nov 22 16:27 fudan
drwxr-xr-x  2 rootd         512 Nov 22 16:27 testfile

blorf/testfile:
total 0

The "total xxx" lines, as well as the blank lines, are useless to us. We need the following information so our search engine can determine if we have a "hit":

Filename

And we need the following additional information for our printout:

Ftp site's primary ip address
Ftp site's domain name
Permissions
Link count
Owner name
Size
Modification time and date

Directories

The INSTALLDIR/database-ng directory is where we keep our database. Inside we have four directories: data, direc, index, and nocaseindex. The index contains what we need for our case-sensitive searches: the filenames from the ls -lR listings. The nocaseindex contains the same filenames with all the upper case characters converted to lower-case. The data directories contains all the other information we need for our printout, except the directory. Instead it has a "directory number" which corresponds to a line number in a file in the direc directory.

File Names

Each directory contains one file for each ftp site. The filename is of the format:

domainname-ip address

For example, the data directory in our prototype server has the following files:

ftp.cs.pdx.edu-131.252.21.12
wuarchive.wustl.edu-128.252.135.4
ftp.ee.pdx.edu-131.252.20.155

Remember that there is one file for each ftp site in each of the directories, so our other three directories (direc, index, and nocaseindex) all have the exact same filenames.

The Index File

The index file contains the filenames from the ls -lR of the indexed ftp site. Here are the first five lines from the file INSTALLDIR/database-ng/index/ftp.cs.pdx.edu-131.252.21.12 :

README
README-Stats
bin
etc
lost+found

Lines are separated by newlines. The last line is terminated with a newline.

This file has all the information that our searcher needs to search for case sensitive substring searches, exact searches, and regular expression searches. Typically, this file is about 25% of the size of the original ls -lR file.

The Nocaseindex File

This index file is identical to the above file, except that all upper case characters are converted to lower case. This makes our case insensitive search engine more efficient (because it doesn't need to test for multiple cases at search time).

Here is are the first five lines from the nocaseindex file for our example:

readme
readme-stats
bin
etc
lost+found

Note that the line numbers correspond between these files. This is very important because our search engine will use the line number (starting the count at 0) to index the corresponding file in the data directory.

Data file

The data file consists of all the data we need to print out about each file, except the directory. Instead of a directory, we have a directory number which gives the line number of the directory in the corresponding INSTALLDIR/database-ng/direc file.

Here are the corresponding five lines of our data file for the above example:

-rw-r--r--    1         0         1       402  Mar    7  1994    0
-rw-r--r--    1         0         1     21285  Oct   15 07:51    0
d--x--x--x    2         0         1       512  Feb   16  1994    0
d--x--x--x    2         0         1       512  Feb   16  1994    0
drwxr-xr-x    2         0         0      8192  Feb   16  1994    0

The columns(and their sizes, in characters) are: permissions(10), link count(5), owner(10), group(10), size(10), month(5), day(5), year/time(6), and directory number(5). Including the newline at the end, one line is exactly 67 bytes long.

Since the searcher counts line numbers as it searches through the index, and since the line numbers correspond between the index and the data, the searcher can use lseek() to go directly to the correct line number after a successful search.

Direc File

Since directory names have widely varying sizes, we cannot include them in the above file without eliminating the fixed line size. As a result, we keep the directory names in a separate file (one per line) and have the directory number index the line number.

Here are the first five lines from the corresponding file in the direc directory:

/
lost+found
pub
pub/NeXT
pub/NeXT/sounds
pub/aber

Since the directory number was "0" for all the files in our example, they correspond to the directory in line number 0 of our direc file, so all those files are in the root directory. If we had a file in directory number 2, then it would be in the pub directory.

In our prototype, the directory file is typically 3% of the size of our original ls -lR file.

Go to the previous, next section.