Run

When developing locally, this directory will contain run-time data required for the search engine. In a clean check-out, it only contains the tools required to bootstrap this directory structure.

Requirements

While the system is designed to run bare metal in production, for local development, you're strongly encouraged to use docker or podman. These are a bit of a pain to install, but if you follow this guide you're on the right track.

Set up

To go from a clean check out of the git repo to a running search engine, follow these steps. You're assumed to sit in the project root the whole time.

Run the one-time setup, it will create the basic runtime directory structure and download some models and data that doesn't come with the git repo.

$ run/setup.sh

Compile the project and build docker images

$ ./gradlew assemble docker

Download a sample of crawl data, process it and stick the metadata into the database. The data is only downloaded once. Grab a cup of coffee, this takes a few minutes. This needs to be done whenever the crawler or processor has changed.

$ docker-compose up -d mariadb
$ run/reconvert.sh

Bring the system online. We'll run it in the foreground in the terminal this time because it's educational to see the logs. Add -d to run in the background.

$ docker-compose up

Since we've just processed new crawl data, the system needs to construct static indexes. Wait for the line 'Auto-conversion finished!'

When all is done, it should be possible to visit http://localhost:8080 and try a few searches!

Other Crawl Data

By default, reconvert.sh will load the medium dataset. This is appropriate for a demo, but other datasets also exist.

Set	Description
s	1000 domains, suitable for low-end machines
m	2000 domains
l	5000 domains
xl	50,000 domains, basically pre-prod. Warning: 5h+ processing time

To switch datasets, run e.g.

$ docker-compose up -d mariadb
$ ./run/reconvert.sh l

Experiment Runner

The script experiment.sh is a launcher for the experiment runner, which is useful when evaluating new algorithms in processing crawl data.

2.6 KiB Raw Blame History

Run

Requirements

Set up

Other Crawl Data

Experiment Runner

2.6 KiB

Raw Blame History