Go to the previous, next section.
This module has one purpose: to search the database. The searcher is spawned by the queuer during the archie server initialization. Since it's spawned using the popen() command, the searcher just reads from standard input. In addition, the searcher can get the queuer's PID with the getppid() system call--enabling it to send the searcher signals.
In our current design, our searcher program spawns another searcher program (specific for the type of search, so we have exact, subcase, sub, and regex programs). Our searcher program uses the c-shell to set up the command line parameters for the specific search module, making the specific search module's job as simple as possible.
Although spawning processes is somewhat computationally intensive, it only takes a fraction of a second on a sparc, which is insignificant compared to our goal of 10 seconds per search. We chose to spawn a separate process because it allows us to have a specialized search program for each type of search. Note that our other modules have no idea how our searcher searches for data, so we can change this module's internal operation without fear of damaging the rest of the archie server. This is critically important because the searcher module is the time-limiting-factor on the speed of archie searches. We expect to gradually optomize the search algorithm as time progresses.
Each time the queuer requests a search from the queuer, it sends the following information:
Here is part of the INSTALLDIR/src/include/queuer-searcher.h file, which shows how the search types are defined:
/* Search Types */ #define EXACT 2 #define SUBCASE 3 #define SUBSTRING 4 #define REGEXP 5 #define WHATIS 6
The information is passed as ascii text separated by newlines, so if we stored this information in a file a typical search might look like this:
3 20 README 123.3
Which indicates a case-sensitive substring search for the string README, with a maximum number of hits of 20 and a unique search identifier of 123.3
There is one command-line option for the searcher: -d That is the debug flag and is used during testing of the searcher module (since the searcher reads from standard input, we can perform some tests by redirecting "search" commands from a file). The debug options suppresses the signal sent to the searcher's parent (usually the queuer) which means that when we test the searcher by redirecting input from the command line, we won't be sending signals to our shell (causing it to terminate).
When a search is complete, the searcher creates a file in the INSTALLDIR/soutput directory with the unique search identifier as the name. The file contains the output of the search. The following box shows an example output:
Host hilbert.maths.utas.edu.au Location: /pub/linux/slackware/contents FILE -r--r--r-- 1672 Sep 15 02:28 gcc Location: /pub/linux/slackware/contents/scripts FILE -r--r--r-- 146 Sep 15 08:31 gcc Host ftp.apple.com Location: /dts/mac/tools DIRECTORY drwxr-xr-x 512 Nov 21 1993 gcc
After creating and closing this file, the searcher will send a SIGUSR1 signal to its parent, indicating that the search is complete.
Since our search requests come from standard input, we can use stdio to help parse the input. Here is the input parser code from our prototype:
retval=scanf("%s",chartype); if (retval<1) exit(0); retval=scanf("%s",onehits); if (retval<1) exit(0); retval=scanf("%s\n",onestring); if (retval<1) exit(0); retval=scanf("%s",oneid); if (retval<1) exit(0); onetype=atoi(chartype);
This reads in the four lines of search information without blocking while waiting for additional lines.
Our prototype searcher actually spawns a separate search program, one for each type of search. This is somewhat inefficient, but not horribly so-- thanks to our modularization we can first implement it so it works correctly, and later we can improve it.
What we do is exec the real searcher program with the following command line arguments:
With the output redirected to INSTALLDIR/soutput/IDNAME
Where IDNAME is the unique id of the search. Generating the name of each index file is easy with the c-shell. We use form a command string where we have INSTALLDIR/database-ng/INDEXDIR/* be the third argument, and we have a function which uses the c-shell to expand that to be every file in the directory.
This function is used by the searcher to spawn the appropriate actual search program. It is based heavily on the system() system call source code given in Richard Stevens' Advanced Programming in the UNIX Enviornment book.
Here is the source code from our prototype:
int csystem(const char *cmdstring) { int pid; int status; if (cmdstring==NULL) return(1); if ((pid=fork())<0) { status=-1; } else if (pid==0) { execl("/bin/csh","csh","-c",cmdstring,(char *) 0); _exit(127); } else { while (waitpid(pid,&status,0)<0) if (errno !=EINTR) { status = -1; break; } } return(status); }
We need to generate a string which will be passed to our csystem() function. If INSTALLDIR was /disk/camenbert/archie/server, our search was an exact search, our unique identifier was 123.3 and our number of hits was 20, then the command looks like this:
exact README 20 /disk/camenbert/archie/server/database-ng/index/* > /disk/camenbert/archie/server/soutput/123.3
Note that since this is an exact search, we use the files in the index directory to search through (as opposed to the nocaseindex directory).
Here is some of the code from our prototype that partially produced the command, it shows how we string concatonate to build the command, and then call the csystem() function.
Just above these lines we can use a switch statement to put the appropriate program name into the command (we use the names "exact", "subcase", "sub", and "regex").
(void)strcat(command,onestring); (void)strcat(command," "); (void)strcat(command,onehits); (void)strcat(command," "); (void)strcat(command,"/disk/camenbert/archie/server"); (void)strcat(command,"/database-ng/index/* > "); (void)strcat(command,"/disk/camenbert/archie/server"); (void)strcat(command,"/soutput/"); (void)strcat(command,oneid); if ((status = csystem(command))<0) err_sys("csystem() error"); pr_exit(status);
Note that our prototype always put the "index" into the command (as opposed to putting index or nocaseindex), and had /disk/camenbert/archie/server hard coded.
Now our searcher subprograms have been called with standard output going to the destination output file, and with the following command line parameters:
The searcher programs each follow the following algorithm:
Then the following algorithm is followed:
repeat pop the next command line argument (a filename to search) open the file repeat read one line from the index file search for the search-string in the line if you find a hit: obtain the file information from the rest of the database print out the file information in the correct format decrement the hits integer if the hits integer is zero, exit until EOF close the file until all command line arguments are expended (all files are searched)
Note that the "parent" searcher program waits for its child to exit, then sends SIGUSR1 to its parent (the queuer), and then the parent searcher program reads the next search request from standard input.
Now for some information on the algorithms used by each specific module to determine if there is a hit.
This program just reads each input file one line at a time, and does a strncmp() between the search string and the line (by using strncmp instead of strcmp, we guarantee that we won't have to worry about comparing the terminating newline to the terminating null character).
The exact searcher could be incredibly fast by setting up a dedicated sorted index file, but since most searches are substring searches, and exact searches are already fast, this is unnecessary.
This searcher uses the mmap() system call to map the file into memory. The program then loops through the entire file, doing the following at each character:
If the current character is a newline increment the line count increment the character pointer set curlinestart to the character pointer else if strncmp(search string, current pointer) we have a hit, so find and print out the data, then goto the next line
Since we have separate index directories for case sensitive and case insensitive searches, if we convert our upper-case search string characters to lower case, and then use the nocaseindex file instead of the index file during our search, we can use the exact same algorithm for the case insensitive searcher as for our case sensitive one. This is one of the largest advantages of our database design.
This is to be determined, although rather than reinventing the wheel we plan to grab source code from GNU's egrep facility for this purpose.
In order to search for a particular string, we need to perform the following steps:
Our goal is to be able to perform a substring search in under 10 seconds.
The time required to perform steps 1, 2, 4, and 5 is insignificant (and constant relative to the size of the database--which is increasing). Step number 3, reading and searching each database file, is our rate limiting step.
mmap() is the fastest way possible to read in a file on a unix system. It works as fast as virtual memory (basically, its acting as if the file is virtual memory, and the char* the mmap() call returns is simply addressing that memory).
We minimize the size of the file to be read by separating the database into sections we have to search, and sections we only look at if we find a hit. As a result, instead of searching an entire ls -lr of an anonymous ftp site, we only search the filenames. This reduces the size of the files we need to open by 70%.
The inefficient part of this system is the number of files that we map into memory with the mmap() system call. If we were to create one large file with all of the filenames which we need to search, it would be faster.
Unfortunately, designing the system to have one large database file is difficult for several reasons:
Creating one large database file is feasable, however, and may be attempted in a later version of the free-archie server.
We search each file by iterating through the file and doing a strncmp at each character. The speed of the process is based on the:
We have already taken steps to optomize our performance. By listing only the filenames in the files through which we search, we decreased the size of the files by 70%. By having separate index files for case-sensitive and case-insensitive searching, we made certain that we only needed one comparison against the search-string at each character. Unfortunately, in our current system we have many separate files which we map into memory and search. Under some circumstances we look at a particular character multiple times.
There are two approaches which might further increase the speed of this type of search:
Unfortunately, creating an ordered index will not completely work. The problem is that we need to search for arbitrary substrings--not just strings starting with the first character. As a result, we cannot just order our strings based on the first character. In order to gain the full benefit of an ordered index, we need to put multiple entries into the index for each filename: one entry ordered by the first character of the filename, one entry ordered by the second character of the filename, etc... This would increase the size of our database-index by an order of magnitude.
There is hope, however, for the ordered index approach. If we order the index by the first character in the filename, we can eliminate duplicate names. Since a large number of files are duplicated on anonymous ftp sites, this would decrease the size of our database significantly. Using an ordered index in one gigantic database file would produce an even greater savings--since a high percentage of filenames are duplicated on multiple ftp sites.
The deterministic finite automata We currently test each character multiple times. We test to see if it is a newline, then we test to see if it matches the first character of our search-string (in the strncmp function). If the first character matches, then we goto the next character in the index file and see if it matches the second character of the search-string. Later, we test the next character again--to see if it matches the first character of our search-string.
This is very inefficent. If we have our searcher generate a deterministic finite automata for each search-string, (including the code to handle newlines), our search speed would improve. Code which does this is included in some publicly available grep programs. Some customization would be necessary to include the functionality to count newlines.
Go to the previous, next section.