2.6 KiB
Run
When developing locally, this directory will contain run-time data required for the search engine. In a clean check-out, it only contains the tools required to bootstrap this directory structure.
Requirements
While the system is designed to run bare metal in production, for local development, you're strongly encouraged to use docker or podman. These are a bit of a pain to install, but if you follow this guide you're on the right track.
Set up
To go from a clean check out of the git repo to a running search engine, follow these steps. You're assumed to sit in the project root the whole time.
- Run the one-time setup, it will create the basic runtime directory structure and download some models and data that doesn't come with the git repo.
$ run/setup.sh
- Compile the project and build docker images
$ ./gradlew assemble docker
- Download a sample of crawl data, process it and stick the metadata into the database. The data is only downloaded once. Grab a cup of coffee, this takes a few minutes. This needs to be done whenever the crawler or processor has changed.
$ docker-compose up -d mariadb
$ run/reconvert.sh
- Bring the system online. We'll run it in the foreground in the terminal this time
because it's educational to see the logs. Add
-d
to run in the background.
$ docker-compose up
- Since we've just processed new crawl data, the system needs to construct static indexes. Wait for the line 'Auto-conversion finished!'
When all is done, it should be possible to visit http://localhost:8080 and try a few searches!
Other Crawl Data
By default, reconvert.sh
will load the medium dataset. This is appropriate for a demo,
but other datasets also exist.
Set | Description |
---|---|
s | 1000 domains, suitable for low-end machines |
m | 2000 domains |
l | 5000 domains |
xl | 50,000 domains, basically pre-prod. Warning: 5h+ processing time |
To switch datasets, run e.g.
$ docker-compose up -d mariadb
$ ./run/reconvert.sh l
Experiment Runner
The script experiment.sh
is a launcher for the experiment runner, which is useful when
evaluating new algorithms in processing crawl data.