Go to the previous, next section.

Searcher Module

@everyfooting Author: rootd // Editor: ensley // Texinfo: rootd @|@|3 December 1994

This module has one purpose: to search the database. The searcher is spawned by the queuer during the archie server initialization. Since it's spawned using the popen() command, the searcher just reads from standard input. In addition, the searcher can get the queuer's PID with the getppid() system call--enabling it to send the searcher signals.

In our current design, our searcher program spawns another searcher program (specific for the type of search, so we have exact, subcase, sub, and regex programs). Our searcher program uses the c-shell to set up the command line parameters for the specific search module, making the specific search module's job as simple as possible.

Although spawning processes is somewhat computationally intensive, it only takes a fraction of a second on a sparc, which is insignificant compared to our goal of 10 seconds per search. We chose to spawn a separate process because it allows us to have a specialized search program for each type of search. Note that our other modules have no idea how our searcher searches for data, so we can change this module's internal operation without fear of damaging the rest of the archie server. This is critically important because the searcher module is the time-limiting-factor on the speed of archie searches. We expect to gradually optomize the search algorithm as time progresses.

Inputs

Each time the queuer requests a search from the queuer, it sends the following information:

type of search
number of hits
search string
unique search identifier

Here is part of the INSTALLDIR/src/include/queuer-searcher.h file, which shows how the search types are defined:

/* Search Types */
#define EXACT     2
#define SUBCASE   3
#define SUBSTRING 4
#define REGEXP    5
#define WHATIS    6

The information is passed as ascii text separated by newlines, so if we stored this information in a file a typical search might look like this:

3
20
README
123.3

Which indicates a case-sensitive substring search for the string README, with a maximum number of hits of 20 and a unique search identifier of 123.3

The Debug Command Line Option

There is one command-line option for the searcher: -d That is the debug flag and is used during testing of the searcher module (since the searcher reads from standard input, we can perform some tests by redirecting "search" commands from a file). The debug options suppresses the signal sent to the searcher's parent (usually the queuer) which means that when we test the searcher by redirecting input from the command line, we won't be sending signals to our shell (causing it to terminate).

Output

When a search is complete, the searcher creates a file in the INSTALLDIR/soutput directory with the unique search identifier as the name. The file contains the output of the search. The following box shows an example output:


Host hilbert.maths.utas.edu.au

    Location: /pub/linux/slackware/contents
           FILE -r--r--r--       1672  Sep 15 02:28  gcc
    Location: /pub/linux/slackware/contents/scripts
           FILE -r--r--r--        146  Sep 15 08:31  gcc

Host ftp.apple.com

    Location: /dts/mac/tools
      DIRECTORY drwxr-xr-x        512  Nov 21 1993  gcc

After creating and closing this file, the searcher will send a SIGUSR1 signal to its parent, indicating that the search is complete.

Parsing Input

Since our search requests come from standard input, we can use stdio to help parse the input. Here is the input parser code from our prototype:

                retval=scanf("%s",chartype);
                if (retval<1) exit(0);
                retval=scanf("%s",onehits);
                if (retval<1) exit(0);
                retval=scanf("%s\n",onestring);
                if (retval<1) exit(0);
                retval=scanf("%s",oneid);
                if (retval<1) exit(0);

                onetype=atoi(chartype);

This reads in the four lines of search information without blocking while waiting for additional lines.

Spawning the correct type of search program

Our prototype searcher actually spawns a separate search program, one for each type of search. This is somewhat inefficient, but not horribly so-- thanks to our modularization we can first implement it so it works correctly, and later we can improve it.

What we do is exec the real searcher program with the following command line arguments:

Search string
Number of hits
The name of each index file

With the output redirected to INSTALLDIR/soutput/IDNAME

Where IDNAME is the unique id of the search. Generating the name of each index file is easy with the c-shell. We use form a command string where we have INSTALLDIR/database-ng/INDEXDIR/* be the third argument, and we have a function which uses the c-shell to expand that to be every file in the directory.

csystem()

This function is used by the searcher to spawn the appropriate actual search program. It is based heavily on the system() system call source code given in Richard Stevens' Advanced Programming in the UNIX Enviornment book.

Here is the source code from our prototype:

int csystem(const char *cmdstring) {
        int pid;
        int status;

        if (cmdstring==NULL)
                return(1);

        if ((pid=fork())<0) {
                status=-1;

        } else if (pid==0) {
                execl("/bin/csh","csh","-c",cmdstring,(char *) 0);
                _exit(127);

        } else {
                while (waitpid(pid,&status,0)<0)
                        if (errno !=EINTR) {
                                status = -1;
                                break;
                        }
        }
        return(status);
}

Generating the Command

We need to generate a string which will be passed to our csystem() function. If INSTALLDIR was /disk/camenbert/archie/server, our search was an exact search, our unique identifier was 123.3 and our number of hits was 20, then the command looks like this:

exact README 20 /disk/camenbert/archie/server/database-ng/index/*
> /disk/camenbert/archie/server/soutput/123.3

Note that since this is an exact search, we use the files in the index directory to search through (as opposed to the nocaseindex directory).

Here is some of the code from our prototype that partially produced the command, it shows how we string concatonate to build the command, and then call the csystem() function.

Just above these lines we can use a switch statement to put the appropriate program name into the command (we use the names "exact", "subcase", "sub", and "regex").

                (void)strcat(command,onestring);
                (void)strcat(command," ");
                (void)strcat(command,onehits);
                (void)strcat(command," ");
                (void)strcat(command,"/disk/camenbert/archie/server");
                (void)strcat(command,"/database-ng/index/* > ");
                (void)strcat(command,"/disk/camenbert/archie/server");
                (void)strcat(command,"/soutput/");
                (void)strcat(command,oneid);
                if ((status = csystem(command))<0)
                        err_sys("csystem() error");
                pr_exit(status);

Note that our prototype always put the "index" into the command (as opposed to putting index or nocaseindex), and had /disk/camenbert/archie/server hard coded.

The Searcher Subprograms

Now our searcher subprograms have been called with standard output going to the destination output file, and with the following command line parameters:

Search String
Number of Hits
A list of all the files that need to be searched

The searcher programs each follow the following algorithm:

Pop off the search string, and put it in a local string buffer
Pop off the number of hits, and put it in an integer

Then the following algorithm is followed:

repeat
	pop the next command line argument (a filename to search)
	open the file
	repeat
		read one line from the index file
		search for the search-string in the line
		if you find a hit:
			obtain the file information from the rest of the database
			print out the file information in the correct format
			decrement the hits integer
			if the hits integer is zero, exit
	until EOF
	close the file
until all command line arguments are expended (all files are searched)

Note that the "parent" searcher program waits for its child to exit, then sends SIGUSR1 to its parent (the queuer), and then the parent searcher program reads the next search request from standard input.

Now for some information on the algorithms used by each specific module to determine if there is a hit.

The Exact Searcher

This program just reads each input file one line at a time, and does a strncmp() between the search string and the line (by using strncmp instead of strcmp, we guarantee that we won't have to worry about comparing the terminating newline to the terminating null character).

The exact searcher could be incredibly fast by setting up a dedicated sorted index file, but since most searches are substring searches, and exact searches are already fast, this is unnecessary.

The Case Sensitive Searcher

This searcher uses the mmap() system call to map the file into memory. The program then loops through the entire file, doing the following at each character:

If the current character is a newline
	increment the line count
	increment the character pointer
	set curlinestart to the character pointer
else
	if strncmp(search string, current pointer)
		we have a hit, so find and print out the data, then goto the next line

The Case Insensitive Searcher

Since we have separate index directories for case sensitive and case insensitive searches, if we convert our upper-case search string characters to lower case, and then use the nocaseindex file instead of the index file during our search, we can use the exact same algorithm for the case insensitive searcher as for our case sensitive one. This is one of the largest advantages of our database design.

The Regular Expression Searcher

This is to be determined, although rather than reinventing the wheel we plan to grab source code from GNU's egrep facility for this purpose.

Efficiency of the Searcher Subprograms

In order to search for a particular string, we need to perform the following steps:

Pass the information to the searcher via the queuer
Spawn the specific search subprogram
Read/search each database file
Write the output to the INSTALLDIR/soutput directory
Send signal indicating the searcher is ready to search again

Our goal is to be able to perform a substring search in under 10 seconds.

The time required to perform steps 1, 2, 4, and 5 is insignificant (and constant relative to the size of the database--which is increasing). Step number 3, reading and searching each database file, is our rate limiting step.

Efficiency of Reading Files

mmap() is the fastest way possible to read in a file on a unix system. It works as fast as virtual memory (basically, its acting as if the file is virtual memory, and the char* the mmap() call returns is simply addressing that memory).

We minimize the size of the file to be read by separating the database into sections we have to search, and sections we only look at if we find a hit. As a result, instead of searching an entire ls -lr of an anonymous ftp site, we only search the filenames. This reduces the size of the files we need to open by 70%.

The inefficient part of this system is the number of files that we map into memory with the mmap() system call. If we were to create one large file with all of the filenames which we need to search, it would be faster.

Unfortunately, designing the system to have one large database file is difficult for several reasons:

It would require virtual memory at least twice the size of the database
Updating the database would become very difficult

Creating one large database file is feasable, however, and may be attempted in a later version of the free-archie server.

Efficiency of Searching Files

We search each file by iterating through the file and doing a strncmp at each character. The speed of the process is based on the:

Number of files through which we have to search
Size of the files
Number of times we look at each character in each file

We have already taken steps to optomize our performance. By listing only the filenames in the files through which we search, we decreased the size of the files by 70%. By having separate index files for case-sensitive and case-insensitive searching, we made certain that we only needed one comparison against the search-string at each character. Unfortunately, in our current system we have many separate files which we map into memory and search. Under some circumstances we look at a particular character multiple times.

There are two approaches which might further increase the speed of this type of search:

Creating an ordered index
Using a deterministic finite automata

Unfortunately, creating an ordered index will not completely work. The problem is that we need to search for arbitrary substrings--not just strings starting with the first character. As a result, we cannot just order our strings based on the first character. In order to gain the full benefit of an ordered index, we need to put multiple entries into the index for each filename: one entry ordered by the first character of the filename, one entry ordered by the second character of the filename, etc... This would increase the size of our database-index by an order of magnitude.

There is hope, however, for the ordered index approach. If we order the index by the first character in the filename, we can eliminate duplicate names. Since a large number of files are duplicated on anonymous ftp sites, this would decrease the size of our database significantly. Using an ordered index in one gigantic database file would produce an even greater savings--since a high percentage of filenames are duplicated on multiple ftp sites.

The deterministic finite automata We currently test each character multiple times. We test to see if it is a newline, then we test to see if it matches the first character of our search-string (in the strncmp function). If the first character matches, then we goto the next character in the index file and see if it matches the second character of the search-string. Later, we test the next character again--to see if it matches the first character of our search-string.

This is very inefficent. If we have our searcher generate a deterministic finite automata for each search-string, (including the code to handle newlines), our search speed would improve. Code which does this is included in some publicly available grep programs. Some customization would be necessary to include the functionality to count newlines.

Go to the previous, next section.