Authors:

Gilles AUDOLY    -    Frank CELESTINI


ESSI 3rd year students, 1998-1999




PRESENTATION


Overview


What is Vagabond

Vagabond is a web search engine. A web search engine does its best to collect as much HTML documents as possible. The incoming documents are processed and stored in a database. Anytime, queries can be done to find documents according to search criterias. The vagabond's limits are mostly the host hardware capabilities.

http://www-local.essi.fr/~audoly/vagabond/search.html: Vagabond online, for local users (We are sorry for the others)
vagabond.tar.gz: Vagabond source and HTML archive to download


Vagabond's features

As a web search engine, vagabond is not yet as efficient as the most popular web search engines (Alta Vista, Yahoo, ...). Nevertheless, it is operational, and some of the nowadays search engines don't have some of the vagabond's capabilities. The main vagabond features are listed below:

  • Recursive link follower document collector robot to be started from a given URL.
  • Web interface for:
  • searching documents
  • checking document presence
  • proposing a new site
  • Complex searching structure:
  • word matching
  • exact sentence matching
  • optional (implicit), obligatory (+), or forbidden (-) word/sentence specifying
  • word/sentence localization (text, title, url, ...) specifying
  • Ordered results to make appear in head the most official or declared sites related to the search requests
  • Web server performance saving:
  • remote database
  • cached responses
  • Daemon to permanently:
  • ensure the database daemon life
  • check if the database documents still exist
  • increase the database by collecting the documents referenced in the database documents
  • collect proposed site documents
  • control the cache folder content size
  • Offer commercial space activated by bought words in request
  • Specify bonus to make documents appear higher in the search result list
  • Network performance, thanks to HTTP keep-alive mode supported
  • Disk space saving thanks to shared binaries for multiple services
  • Database performance and reliability
  • quick searching thanks to word indexes
  • date comparison to decide whether a document should be updated in the database or not
  • database table locking to prevent from incoherence
  • signals intercepted by the daemon and the robot, not the program to be stopped inside a complex database updating operation.
  • Web server time saving thanks to real database server use (Glimpse use would have been unrealistic compared to a real database use)




  • Vagabond possible improvements

  • document container dedicated database to improve disk space requirement and database modifying duration
  • proxy use
  • file indexing engine (to be used for your local documents)
  • binary independent configuring (changing the settings like paths without re-compiling)
  • multitask strategy for the collecting robot in order to improve the performances


  • Installation

    Vagabond, even though being hardware independent, is based upon fixed technologies:
  • Unix system (Solaris used for this development) for remote commands, and reliable well designed features
  • MySql database use
  • EGCS (g++ compatible) C++ compiler with the STL (standard template library) for strings and containers
  • The three points listed below can be seen as very restrictive, but their public domain characteristics will surely not stop you from using Vagabond.
    The most difficult point is the Unix system. If you're using another system than Unix, you can try to install Linux or free BSD. Anyway, you'll have to completely change your strategy and assume all the new problems to meet.
    The last two points are very easy to solve. MySql and EGCS are available as public domain software on web sites.


    Vagabond is a distributed system. The optimal use is based upon 3 different hosts:

  • the host where your web server runs (web_host)
  • the host where the database runs (db_host)
  • the host where the vagabond's daemon runs (daemon_host)
  • If don't have enough hosts to offer to Vagabond, you can assume common hosts.


    To specify your settings, you must edit the Makefile in vagabond/prog/ to specify the following variables:
    MYSQL_HOME=.../mysql
    WWW_PATH=.../www/Internet/vagabond
    DEFINES=-DROOT_PASSWD=\"root_password\" -DDB_HOST_NAME=\"db_host\" -DWWW_VAGABOND_PATH=\"$(WWW_PATH)\" -DMYSQL_HOME=\"$(MYSQL_HOME)\"
    stop_db:
      cd mysql && bin/mysqladmin -u root -proot_password shutdown

    WWW_PATH is the path to find the root to Vagabond's web pages.
    You can edit whatever you find relevant to change in the Makefile to fit your system, but any italic text have to be changed.

    To enable the start_db and stop_db Makefile commands to work, you must set a symbolic link to the mysql folder, called mysql:
    vagabond/prog> ln -s /u/dessi3/celestin/mysql mysql

    To enable the Vagabond's daemon to remotely start the database daemon, you must be sure to have a well defined ~/.rhost file. For example, if your name is toto, you must specify this line in your .rhost file:
    daemon_host toto


    To build Vagabond's binary files, type the following 2 commands (it can take a long time, due to STL use, if your system is not very efficient):
    vagabond/prog> touch .depend
    vagabond/prog> make depend
    vagabond/prog> make


    The next thing to do is to set the privileges.
    If it is the first time you use mysql, you must first create default privileges:
    @db_host:mysql> scripts/mysql_install_db

    Then, you must specify Vagabond's specific privileges:
    @db_host:vagabond/prog> nohup make start_db &
    @db_host:mysql> bin/mysql -u root mysql
    mysql> insert into user values('%','root','','Y','Y','Y','Y','Y','Y','Y','Y','Y','Y','Y','Y','Y','Y');
    mysql> insert into user values('localhost','www','','Y','N','N','N','N','N','N','N','N','N','N','N','N','N');
    mysql> insert into user values('db_host','www','','Y','N','N','N','N','N','N','N','N','N','N','N','N','N');
    mysql> insert into user values('%','www','','Y','N','N','N','N','N','N','N','N','N','N','N','N','N');
    mysql> flush privileges;
    mysql> \q
    @db_host:mysql> bin/mysqladmin -u root password root_password
    @db_host:mysql> bin/mysqladmin -u root -h db_host password root_password
    @other_host:mysql> bin/mysqladmin -u root -h db_host password root_password

    We strongly suggest you to read the mysql manual for complete information and problem solving.

    Vagabond needs a cache folder where the CGI binary file will be able to write temporary files. Consequently, you must create and give write permissions to the user executing the web server, usually nobody from the group nobody.
    vagabond/prog/www> mkdir cache
    vagabond/prog/www> setfacl -s user::rwx,user:nobody:rwx,group::r-x,other:r-x,mask:rwx cache

    If you don't have a Solaris system, the only way is to give write permissions to all:
    vagabond/prog/www> chmod 777 cache


    Your Vagabond set should now be correctly installed and ready to work. The last thing to do is to start the Vagabond's daemon:
    > rsh daemon_host
    @daemon_host:> cd vagabond/prog
    @daemon_host:vagabond/prog> nohup bin/vagabond_daemon &
    @daemon_host:vagabond/prog> bin/add_commercials commercials.txt
    @daemon_host:vagabond/prog> exit

    If you don't want the default daemon's setting, you can specify them on the command line. The easiest way to start your Vagabond's daemon with your settings is to write a script. Our script is bin/daemon_ok. You can modify it for your own settings.
    See the next chapter for more details on how to start the daemon or how to fill the document database.



    Programs

    Vagabond's programs are divided into two categories: the administrative and the CGI (for web access) programs.


    Administrative programs

    vagabond/prog/bin/admin the real program used by the listed below (can't be directly executed)
    vagabond/prog/bin/daemon_ok script for starting the daemon with personal settings
    vagabond/prog/bin/stop_daemon script for stopping the daemon (execute it on the daemon host)
    vagabond/prog/bin/vagabond_daemon daemon
    vagabond/prog/bin/robot fill the database from a site
    vagabond/prog/bin/view view database document list or a specified document informations
    vagabond/prog/bin/remove remove a document from the database
    vagabond/prog/bin/search search documents matching search criterias
    vagabond/prog/bin/bonus view document bonus
    vagabond/prog/bin/change_bonus set document bonus
    vagabond/prog/bin/add_commercials add commercial specifications
    vagabond/prog/bin/commercials list commercials to a specified word
    vagabond/prog/bin/remove_commercials remove commercials
    vagabond/prog/bin/status database status

    web programs

    vagabond/prog/www/web.cgi the real program used by the listed below (can't be directly executed)
    vagabond/prog/www/search.cgi searching CGI (used by www/search.html)
    vagabond/prog/www/addurl.cgi site proposing CGI (used by www/add_url.html)
    vagabond/prog/www/checkurl.cgi document URL presence checking CGI (used by check_url.html)

    bin/admin and www/web.cgi

    These two programs are the only two real binary files. To save disk space, all the administrative programs are simple symbolic links to bin/admin, and all CGI programs are symbolic links to www/web.cgi. According to the program name executed, they are able to determine which service to run.


    bin/daemon_ok

    This program is a simple script example on how to have personal settings for the daemon. It is our way not to avoid a specific always impossible to find configuration file.


    bin/stop_daemon

    This program is another script file. This script looks for all the process running under the name "vagabond_daemon", get their PID numbers, and kill them. The process listing is done thanks to the "ps -Af" command on Solaris. If you are using SunOS 4 or Linux, you must replace this command by "ps -aux".


    bin/vagabond_daemon

    This program is the Vagabond's daemon. A daemon should keep on running even after you log off. Then, you must start it with nohup: "nohup bin/vagabond_daemon &". The other way is to use the detached option: "bin/vagabond_daemon -d". The daemon options are numerous. You can list them with the "-h" or "-help" option. You will get the following result:

    usage: vagabond_daemon options
    options are: (all periods are in seconds)
    -h -help : Print this help message
    -d -detached : Automatically detach the daemon process (default: false)
    -db_period N : Database lifeness checking process period (default: 60)
    -cache_period N : Cache folder cleaning process period (default: 3600)
    -cache_max N : Maximum number of file to keep in the cache folder when cleaning (default: 100)
    -check_period N : Database content checking and increasing process period (default: 60)
    -check_nb N : Number of documents to check and to use for database increase each period (default: 100)
    -growth_depth N : Recursion level when increasing the database (default: 1)
    -l -local : Don't go on other sites when increasing the database (default: false)
    -L -Local : Don't go in parent directories when increasing the database (default: false)
    -search_period N : Proposed sites collecting process period (default: 3600)
    -search_depth N : Proposed sites collecting search depth (default: 5)
    -confirmed : Don't collect unconfirmed proposed sites (default: false)
    -v -verbose : Display messages (default: false)
    The daemon is a 4 process set.
  • It ensures the database daemon is still alive (ie: the database can be accessed).
    Periodically, it tries to connect to the database. If it is not possible, it restarts the database daemon with a remote shell to the database host.
  • It ensures the database referenced documents sill exist and follow their links to increase the database.
    Periodically, it starts the robot from check_nb database document URLs. If the robot is unable to get the document, the reference is removed from the database. If the document is more recent than the database document, the database is updated. The same operation is done for the followed by link documents (usually 1 level).
  • It collects proposed site documents
    Periodically, the daemon starts a collector robot with search_depth (see the robot program description) to the proposed site URLs. The collecting is done locally to the proposed host and the proposed path. Sites are automatically collected unless the daemon to be started with the "-confirmed" option and the sites not manually validated.
    To validate a site, you must manually change the confirmed field of the proposals' rows you wish to validate.
    You must first connect to the database:
    mysql -u root --password=root_password -h db_host web
    Then, you can list the proposed sites:
    mysql> SELECT * from proposals;
    To validate the essi site, you do:
    mysql> UPDATE proposals SET confirmed=1 WHERE url='http://www.essi.fr/';
    You can even validate all the proposed sites:
    mysql> UPDATE proposals SET confirmed=1;
  • It controls the cache folder content size
    Each time a search is performed, the web search service generates one to many cache files in "www/cache/". Thanks to them, the user can navigate without using extra CPU time. Nevertheless, the number of files in this folder increases quickly. To prevent from saturating the folder and the disk, the daemon periodically cleans it. It keeps only the cache_max most recent files.
  • The periodically strategy is defined thanks to the xxx_period options. The daemon does efforts to respect the proposed period, but only if the iteration task duration is between 0 and the half of the period. Otherwise, the daemon inserts sleeping states during the half of the proposed period.
    The "-l" option is very important. Coupled to the "-confirmed" option, it can allow you to only index a single site (the way it is done for indexing the ESSI site).


    bin/robot

    This program directly starts the collector robot. "bin/robot -h" produces the following result:
    usage: bin/robot [-h] [-help] [-l] [-local] [-L] [-Local] [-v] [-verbose] [-i] [-info] URL [recursion]
    -h -help : Print this help message
    -l -local : Don't go on other sites
    -L -Local : Don't go in parent directories
    -v -verbose : Print position informations while searching
    -i -info : Print html document informations each time a document is added
    A collector robot first gets the base URL. If the document is retrieved, it is processed and added to the database (date comparison helps the database to decide whether the document is different or not). If the recursion is not zero (zero is the default value), the robot is started on each URL referenced in this document with the recursion value decreased by one.


    bin/view

    Views database document list or a specified document informations if an argument is specified.
    usage: bin/view [URL]


    bin/remove

    Removes from the database the document referenced by the URL specified as argument.
    usage: bin/remove URL


    bin/search

    Searches documents matching search criterias. The search request is specified in a single argument. Put it in a quoted string if the request contains several words, and precede each quote by a backslash (\).
    See the search help page for help request description.
    usage: bin/search "request"


    bin/bonus

    Views document bonus.
    usage: bin/bonus URL


    bin/change_bonus

    Set document bonus.
    usage: bin/change_bonus URL bonus


    bin/add_commercials

    Adds commercial specifications.
    usage: bin/add_commercials [commercial_file]
    If no file is specified, the standard input is read instead.
    The commercial file specification is made of commercial specification rows.
    A commercial specification row is a four fields row. The fields are separated by sharps (#).
    The first field is the word to be bought.
    The second field is the HTML text to insert as advertising when the word is present in a search request (hopefully if none of the words are shared by other commercials, otherwise random choose thanks to the price paid is made).
    The third field is the price paid.
    The fourth field is the owner name or identifier (to be used for administrative operations).


    bin/commercials

    Lists commercials to a specified word.
    usage: bin/commercials WORD
    The WORD parameter must be an upcase with no accent string.


    bin/remove_commercials

    Removes commercials tanks to commercial specification rows. The same format as for bin/add_commercials is used. Anyway the price field is not relevant.
    usage: bin/remove_commercials [commercial_file]


    bin/status

    Database status. Indicates whether the database daemon is alive or not. If the database daemon is not alive, the program tries to automatically restart it.


    www/search.cgi

    Searching CGI (used by www/search.html). It is the web equivalent to bin/search.


    www/addurl.cgi

    Site proposing CGI (used by www/add_url.html). Insertion in the proposal table to a new URL with the confirmed value to 0.


    www/checkurl.cgi

    Document URL presence checking CGI (used by check_url.html). Tests whether a URL is referencing a database document, is present in the proposal table, or is not known at all.






    IMPLEMENTATION


    Working

    As announced in the presentation document, Vagabond is dependent to 3 main technologies:
  • Unix system (Solaris used for this development) for remote commands, and reliable well designed features
  • MySql database use
  • EGCS (g++ compatible) C++ compiler with the STL (standard template library) for strings and containers
  • Vagabond is a large set of programs, nevertheless its web search engine functionality focus on two main programs: the robot collector and the document searcher.



    The robot

    The robot, from a starting URL, connects to web servers, directly using the socket APIs. It dials using the HTTP protocol. Consequently, it sends a request to retrieve the desired file, get response headers and body. The headers are briefly analyzed to determine whether the server supports keep-alive connection or not, whether it is an html document or not, and the last modification date.
    Then, the text body is use to construct a "rich html document". A "rich html document" contains various informations in order the robot and the database strategy to process it easily. It contains the following informations:
    _ the URL referencing it (simply the one passed as argument to the constructor)
    _ the last modification date
    _ the title
    _ the summary: beginning of the document text to be shown as search result with the URL
    _ the full html document size
    _ the referenced URL list
    _ the word mapping: for each word present, its location type (URL, body, ...) , its repetition count, and its occurrence positions.
    Permanently, the robot contains two lists: the retrieved documents and the documents to retrieve (computed from the referenced URL lists of the retrieved documents). When the caller asks for the next available document, the robot first looks in the retrieved document list. If it is empty, it retrieves the first document of the document to retrieve list. Then, the retrieved list is added a document, and the document to retrieve list is added zero to many new documents. It is evident that the robot avoids direct and indirect cyclic references by maintaining a list containing all the visited URLs.



    The html document processing

    The html document processing, for a robot is important, but not as critical as a web browser. We already assume that all the world pages can not be known. The aim of the robot is to get as much document as possible, and to structure the document words as well as possible. Thus, the two main functions for the processing are word structuring and referenced URL finding.


    Word structuring

    An HTML document is roughly a header and a body section, the last one being composed by text and tags. Consequently, the first task to do is to get the text informations, to remake a continuous text. The tags are lightly processed to know if they must be suppressed or converted to a separator character (eg: <B> is suppressed, and <BR> is replaced by a separator character).
    Then, the last thing to do, thanks to a separator character list, is to split the continuous text into a word sequence. The same kind of processing is performed to the url, the title, or the keyword information.


    URL finding

    The referenced URL finding is linked to the word structuring. Whatever the reference (link, area, frame, ...), it can only be inside a tag. Thus when separating them from the text, the tags are analyzed to find a referenced URL.



    The database storage

    Once a "rich html document" retrieved, it is stored into the database. For modification, the database must be accessed through the root login. The database contains principally two tables for collecting and searching: the "documents" table and the "words" table.
    The "documents" table contains document informations to be displayed when a document matches a search request: the URL, the title, the summary, the last modification date, and the size. In addition, for database easy modification, it contains a list of all the words present in the document.
    The "words" table is a simple association from the word to the string encoded information. The encoded information is mostly the list of the documents using the word. In addition, with the document key, location kind, repetition counter, and precise positions are encoded too.
    Consequently, to add a document, the database interface must insert a document row, and update all the word rows affected by this document.



    The search

    As defined in the Vagabond's search specification, different search kinds are possible. Nevertheless, we can abstract a word, a sentences and a specific location search as a term. Then, we can see there are a set of optional terms, a set of obligatory terms and a set of forbidden terms. Each of the three sets are optional, but there must always be a set of optional terms or a set of obligatory terms.
    Before searching for documents, all the informations related to words present in the search request are retrieved from the database. Thus the goal will be to build up a set of document indexes to finally retrieve the document informations from their indexes and return the list to the requester.
    For the document set building, if the set of obligatory terms is not empty, we compute the intersection between the documents matching for each term. Practically, we build the document set from the first term, and subtract documents anytime they do not match the next terms.
    If the set of obligatory terms is empty, we build the document set from the set of optional terms. We proceed the opposite way. We compute the union of the documents matching for each term.
    If the set of forbidden terms in not empty, we proceed with a difference. For each document matching a term in the set of forbidden terms, we subtract it from the document set.
    Next to this two operations, we now have the document set corresponding the request. The last step consists in ordering them. For each document, there is weight computed by a formula we specified:
    doc_weight <- 0
    For each optional or obligatory term
        doc_weight <- doc_weight+TERM_BONUS
        For each location type matched in the document
           doc_weight <- doc_weight+location_weight

    The TERM_BONUS is 1000 to give importance to match completude
    The location_weight is as follow:
    _ in the body: 1
    _ in the keywords: 10
    _ in title: 100
    _ in the url file: 1000
    _ in url server: 2000
    _ in the url where the file is the root: 4000

    The document-matching-a-term operation is abstracted into a function. When the term is a word, it is easy, since the word informations directly inform if the word is present in the document, and in which location kind.
    When the term is a sentence, the complexity is more important. Anyway, the first sentence word must be present in the document. Then, for each document position, we are searching if the next sentence words match the document at the expected positions.
    The time evaluation is not excessive, but shows the performance difference between word matching and sentence matching. The time complexity is constant to determine if a word of specified kind is present in the document. In compensation, the time worst complexity, to determine if a sentence of specified kind is present in the document, is proportional to the sentence word count multiplied by the first sentence word repetition. In practice, we can suppose a complexity nearly proportional to sum instead of the multiplication, since the sentence matching often fail at the beginning of the sentence.



    The database encodings

    There are two specified text encoding in the database.
    The first and the easiest is the document word list encoding. It is simply an integer list. Only the word keys are stored. The list is converted to text by putting in a string the integers preceded by the list size. (eg: (1,4,45,1) => "4 1 4 45 1")
    The second encoding is the word informations. It respects the same encoding for converting an integer list to a string. Thanks to the list size stored first, we can store lists consecutively and recursively. The word information is as follows:
    list ( document number , list ( location kind , list (positions) ) )
    For example:
    (1 , (text=1 , (1,5,6,19,35) , title=100 , (4)) , 5 , (text=1 , (2,3,4) , keywords=10 , (1) , url_file=1000 , (2))) => "2 1 2 1 5 1 5 6 19 35 100 1 4 5 3 1 3 2 3 4 10 1 1 1000 1 2"



    Sources description

    When we list the source directory, we obtain the following listing:
    dep*
    Makefile
    admin/
    bin/
    common/
    robot/
    web/
    www@
    Makefile contains the compilation rules, and dep is a tool for obtaining well formed dependencies rules.
    bin/ contain the binary files for administrative tasks, www/ is a link to the folder containing the web pages and the CGI binary files, and the 4 other directories contain the source files.

    vagabond/prog/bin/

    vagabond/prog/bin/add_commercials@
    vagabond/prog/bin/admin*
    vagabond/prog/bin/bonus@
    vagabond/prog/bin/change_bonus@
    vagabond/prog/bin/commercials@
    vagabond/prog/bin/remove@
    vagabond/prog/bin/remove_commercials@
    vagabond/prog/bin/robot@
    vagabond/prog/bin/search@
    vagabond/prog/bin/status@
    vagabond/prog/bin/stop_daemon*
    vagabond/prog/bin/vagabond_daemon@
    vagabond/prog/bin/view@
    This folder contains the binary files. Its presentation has already been made in an other section.

    vagabond/prog/www/

    vagabond/prog/www/add_url.html
    vagabond/prog/www/addurl.cgi@
    vagabond/prog/www/cache/
    vagabond/prog/www/check_url.html
    vagabond/prog/www/checkurl.cgi@
    vagabond/prog/www/commercials/
    vagabond/prog/www/help.html
    vagabond/prog/www/search.cgi@
    vagabond/prog/www/search.html
    vagabond/prog/www/vagabond.gif
    vagabond/prog/www/web.cgi
    This folder contains the CGI binary files, the HTML pages, and other resources used by the CGIs or the pages.
    The CGI binary files have already been presented in an other section.
    add_url.html, check_url.html, and search.html are the HTML pages using their respective CGIs.

    vagabond/prog/admin/

    vagabond/prog/admin/add_commercials.cc
    vagabond/prog/admin/add_commercials.hh
    vagabond/prog/admin/admin.cc
    vagabond/prog/admin/bonus.cc
    vagabond/prog/admin/bonus.hh
    vagabond/prog/admin/change_bonus.cc
    vagabond/prog/admin/change_bonus.hh
    vagabond/prog/admin/commercials.cc
    vagabond/prog/admin/commercials.hh
    vagabond/prog/admin/daemon.cc
    vagabond/prog/admin/daemon.hh
    vagabond/prog/admin/remove.cc
    vagabond/prog/admin/remove.hh
    vagabond/prog/admin/robot.cc
    vagabond/prog/admin/robot.hh
    vagabond/prog/admin/search.cc
    vagabond/prog/admin/search.hh
    vagabond/prog/admin/status.cc
    vagabond/prog/admin/status.hh
    vagabond/prog/admin/view.cc
    vagabond/prog/admin/view.hh
    This folder contains the source files for the bin/ directory.
    It is not necessary to describe the source files themselves, but useful to know how the single binary for multiple services works. When admin is called, the main(int,char**) function in main.cc is executed.
    Then, the main function gets the first argument (the program name) and extracts its basename (ie: the name without the directory). The name is not admin, it is the link name. Then, a comparison to the known service is performed, and the right service main function is called with the arguments.

    vagabond/prog/web/

    vagabond/prog/web/add_url.cc
    vagabond/prog/web/add_url.hh
    vagabond/prog/web/cgic.cc
    vagabond/prog/web/cgic.hh
    vagabond/prog/web/check_url.cc
    vagabond/prog/web/check_url.hh
    vagabond/prog/web/search.cc
    vagabond/prog/web/search.hh
    vagabond/prog/web/web.cc
    This folder contains the source files for the CGIs in the www/ directory.
    The same technique as for the admin file is used for sharing multiple CGIs.
    The 3 CGIs uses cgic, a handy library to retrieve the CGI parameters and to send the CGI response.

    vagabond/prog/common/

    vagabond/prog/common/date.cc
    vagabond/prog/common/date.hh
    vagabond/prog/common/db.cc
    vagabond/prog/common/db.hh
    vagabond/prog/common/exception.hh
    vagabond/prog/common/html_document.cc
    vagabond/prog/common/html_document.hh
    vagabond/prog/common/http_url.cc
    vagabond/prog/common/http_url.hh
    vagabond/prog/common/misc.cc
    vagabond/prog/common/misc.hh
    vagabond/prog/common/search_request.cc
    vagabond/prog/common/search_request.hh
    vagabond/prog/common/starter.cc
    vagabond/prog/common/starter.hh
    vagabond/prog/common/text_line.cc
    vagabond/prog/common/text_line.hh
    This folder contains miscellaneous purpose files.
    date.cc/hh provides date interface for date conversions with text.
    db.cc/hh is the database interface. It communicates with the caller with processed html documents, search requests and URLs. All the database storage and searching is made here. Commercials and proposals are even interfaced here.
    exception.hh provides simple named exceptions.
    html_document.cc/hh provides class storage for processed html document and simple html document retrieved from a searching request.
    http_url.cc/hh is an interface for http URLs conversions with text.
    misc.cc/hh contains unclassified handy functions.
    search_request.cc/hh is an interface for complex search request parsing from text.
    starter.cc/hh indicates about the database life, and can restart it.
    text_line.cc/hh enable simple text line retrieving from a file descriptor (eg: network file descriptor).

    vagabond/prog/robot/

    vagabond/prog/robot/client_connection.cc
    vagabond/prog/robot/client_connection.hh
    vagabond/prog/robot/connection.cc
    vagabond/prog/robot/connection.hh
    vagabond/prog/robot/http_client_connection.cc
    vagabond/prog/robot/http_client_connection.hh
    vagabond/prog/robot/robot.cc
    vagabond/prog/robot/robot.hh
    This folder contains the robot class and classes for network connections.
    connection.cc/hh is an interface for connections, initialized by an opened file descriptor.
    client_connection.cc/hh is an interface for connections to be opened as client to a server.
    http_client_connection.cc/hh is an interface for retrieving documents through the HTTP protocol.
    robot.cc/hh is the robot HTML document collector.





    PERFORMANCES

    The first Vagabond's version was a full database dedicated search engine. Associations were made according to the database row shared keys. With this strategy, the http server was completely relaxed from the CPU consumption time due to the searching. In compensation, the storage and the response time were not acceptable. The storage requirements were 6 times the document sizes, and the searching response times were only acceptable for single word searching. The last point could have been improved thanks to database optimizations like indexes, but no sentence search were even done. We could easily imagine the storage requirements and the response time with the word positions in the database and the sentence matching requests.
    Next to discussion with our teachers, we tried their solution to encode the word information and to post-process the database results, even if we were not very convinced. Anyway, the results were surprisingly efficient. We are now using more CPU on the web server, but the storage requirement is now a little bit above the document sizes (+30%), and the response time is acceptable, even for complex requests with multiple sentences (less than 1 second word searching, and often less than 10 seconds for complex request).
    All these tests are made with a database indexing nearly the whole ESSI web site (official and student thanks to link following).
    Unfortunately, we must admit efficiency lacking for document adding and database locking.
    The fact of storing all indexing information as a single string in a word row quickly brings some of them to a very large size. We understand that some words may be indexed in most of the documents (eg: 'A', 'I', 'THE', ...). Nevertheless the string encoding strategy oblige us to decode and code the string for each document word. With the most used words, when the database contains a lot of documents (over than 10000 for example) this operation can take over than one second. Thus, if the document contains a lot of words, the document inserting and removing can take much more than one minute. Anyway, the lack of performance is only for the administrator, not for the user, doing search requests.
    The other lack of efficiency is about the MySql database. The only way to lock is to lock a whole table for reading or writing. Consequently, two documents can not be inserted at the same time. They will have alternatively acces to the database for writing.





    KNOWLEDGE

    The knowledge we learnt can be classified into two categories: new knowledge, and improved knowledge.

    New knowledge:

  • database interfacing from binary language
  • web robot programming
  • html parsing
  • documents indexing


  • Improved knowledge:

  • network programming with the socket API
  • http protocol implementing
  • programming with the STL library
  • Unix programming
  • SQL querying





  • CONTINUATION AND HOPES

    This project is a huge amount of work. The subject really motivated us, since web search engines are much used nowadays.
    Vagabond has many features that not all the web search engines have. For example, Vagabond performs sentence matching and offers commercial space, whereas Yahoo and HotBot do not (Alta Vista does the both, and Google performs sentence matching).
    Its daemon ensures permanent consistency whereas htsearch (the ESSI web search engine) is still initialized with old documents.
    Nevertheless, Vagabond still lacks performance and would be improved by a dedicated database and a multithread collector robot.
    We think that it could be improved for a new internet project next year, and can yet be used instead of htsearch as ESSI web search engine (we assume installing it).




    Gilles Audoly
    Frank Celestini