Go to the previous, next section.
@everyfooting Author: rootd // Editor: ensley // Texinfo: rootd @| @| 3 December 1994
The speed of the searcher is dependent on the format of the database. In addition, since our users can search for arbitrary substrings, we can't use hash tables effectively.
Remember that users can only search for filenames. Users cannot search for the other information on the ls-lR data line (such as the owner of a file). Note that every directory is also a file, so the file /usr/local/bin/X11 will not be scored as a "hit" if someone searches for "local" but the fact that there is a /usr/local/bin/X11 file implies that elsewhere in the filesystem there is a /usr/local file (which is actually a directory) which will be scored as a "hit".
Our goal is to make a small file that can be searched through easily, and have the output from that search allow us to quickly find all the information we need for our printout.
Here is an example of an ls -lR on a unix system:
total 160 drwxr-xr-x 3 rootd 512 Nov 22 16:27 blorf -rwx--r-x 1 rootd 41342 Oct 23 19:34 texi2html blorf: total 1 -rw-r--r-- 1 rootd 0 Nov 22 16:27 fudan drwxr-xr-x 2 rootd 512 Nov 22 16:27 testfile blorf/testfile: total 0
The "total xxx" lines, as well as the blank lines, are useless to us. We need the following information so our search engine can determine if we have a "hit":
And we need the following additional information for our printout:
The INSTALLDIR/database-ng directory is where we keep our database. Inside we have four directories: data, direc, index, and nocaseindex. The index contains what we need for our case-sensitive searches: the filenames from the ls -lR listings. The nocaseindex contains the same filenames with all the upper case characters converted to lower-case. The data directories contains all the other information we need for our printout, except the directory. Instead it has a "directory number" which corresponds to a line number in a file in the direc directory.
Each directory contains one file for each ftp site. The filename is of the format:
domainname-ip address
For example, the data directory in our prototype server has the following files:
ftp.cs.pdx.edu-131.252.21.12 wuarchive.wustl.edu-128.252.135.4 ftp.ee.pdx.edu-131.252.20.155
Remember that there is one file for each ftp site in each of the directories, so our other three directories (direc, index, and nocaseindex) all have the exact same filenames.
The index file contains the filenames from the ls -lR of the indexed ftp site. Here are the first five lines from the file INSTALLDIR/database-ng/index/ftp.cs.pdx.edu-131.252.21.12 :
README README-Stats bin etc lost+found
Lines are separated by newlines. The last line is terminated with a newline.
This file has all the information that our searcher needs to search for case sensitive substring searches, exact searches, and regular expression searches. Typically, this file is about 25% of the size of the original ls -lR file.
This index file is identical to the above file, except that all upper case characters are converted to lower case. This makes our case insensitive search engine more efficient (because it doesn't need to test for multiple cases at search time).
Here is are the first five lines from the nocaseindex file for our example:
readme readme-stats bin etc lost+found
Note that the line numbers correspond between these files. This is very important because our search engine will use the line number (starting the count at 0) to index the corresponding file in the data directory.
The data file consists of all the data we need to print out about each file, except the directory. Instead of a directory, we have a directory number which gives the line number of the directory in the corresponding INSTALLDIR/database-ng/direc file.
Here are the corresponding five lines of our data file for the above example:
-rw-r--r-- 1 0 1 402 Mar 7 1994 0 -rw-r--r-- 1 0 1 21285 Oct 15 07:51 0 d--x--x--x 2 0 1 512 Feb 16 1994 0 d--x--x--x 2 0 1 512 Feb 16 1994 0 drwxr-xr-x 2 0 0 8192 Feb 16 1994 0
The columns(and their sizes, in characters) are: permissions(10), link count(5), owner(10), group(10), size(10), month(5), day(5), year/time(6), and directory number(5). Including the newline at the end, one line is exactly 67 bytes long.
Since the searcher counts line numbers as it searches through the index, and since the line numbers correspond between the index and the data, the searcher can use lseek() to go directly to the correct line number after a successful search.
Since directory names have widely varying sizes, we cannot include them in the above file without eliminating the fixed line size. As a result, we keep the directory names in a separate file (one per line) and have the directory number index the line number.
Here are the first five lines from the corresponding file in the direc directory:
/ lost+found pub pub/NeXT pub/NeXT/sounds pub/aber
Since the directory number was "0" for all the files in our example, they correspond to the directory in line number 0 of our direc file, so all those files are in the root directory. If we had a file in directory number 2, then it would be in the pub directory.
In our prototype, the directory file is typically 3% of the size of our original ls -lR file.
Go to the previous, next section.