2023-03-04 14:35:50 +01:00
|
|
|
# Run
|
|
|
|
|
|
|
|
When developing locally, this directory will contain run-time data required for
|
|
|
|
the search engine. In a clean check-out, it only contains the tools required to
|
|
|
|
bootstrap this directory structure.
|
|
|
|
|
2023-03-06 18:45:01 +01:00
|
|
|
## Requirements
|
2023-03-04 14:35:50 +01:00
|
|
|
While the system is designed to run bare metal in production,
|
|
|
|
for local development, you're strongly encouraged to use docker
|
2023-03-06 18:45:01 +01:00
|
|
|
or podman. These are a bit of a pain to install, but if you follow
|
|
|
|
[this guide](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository)
|
|
|
|
you're on the right track.
|
2023-03-04 14:35:50 +01:00
|
|
|
|
2023-03-06 18:45:01 +01:00
|
|
|
## Set up
|
2023-03-04 16:12:37 +01:00
|
|
|
To go from a clean check out of the git repo to a running search engine,
|
|
|
|
follow these steps. You're assumed to sit in the project root the whole time.
|
2023-03-04 14:35:50 +01:00
|
|
|
|
2023-03-04 16:12:37 +01:00
|
|
|
1. Run the one-time setup, it will create the
|
2023-03-04 16:14:03 +01:00
|
|
|
basic runtime directory structure and download some models and data that doesn't
|
|
|
|
come with the git repo.
|
|
|
|
|
2023-03-04 14:35:50 +01:00
|
|
|
```
|
|
|
|
$ run/setup.sh
|
2023-03-04 16:06:36 +01:00
|
|
|
```
|
|
|
|
|
2023-03-04 16:12:37 +01:00
|
|
|
2. Compile the project and build docker images
|
2023-03-04 14:35:50 +01:00
|
|
|
|
2023-03-04 16:06:36 +01:00
|
|
|
```
|
2023-03-04 15:17:02 +01:00
|
|
|
$ ./gradlew assemble docker
|
2023-03-04 16:06:36 +01:00
|
|
|
```
|
2023-03-04 15:17:02 +01:00
|
|
|
|
2023-03-04 16:12:37 +01:00
|
|
|
3. Download a sample of crawl data, process it and stick the metadata
|
|
|
|
into the database. The data is only downloaded once. Grab a cup of coffee, this takes a few minutes.
|
|
|
|
This needs to be done whenever the crawler or processor has changed.
|
2023-03-04 16:06:36 +01:00
|
|
|
|
|
|
|
```
|
|
|
|
$ docker-compose up -d mariadb
|
2023-03-04 14:35:50 +01:00
|
|
|
$ run/reconvert.sh
|
2023-03-04 16:06:36 +01:00
|
|
|
```
|
|
|
|
|
2023-03-04 16:12:37 +01:00
|
|
|
4. Bring the system online. We'll run it in the foreground in the terminal this time
|
|
|
|
because it's educational to see the logs. Add `-d` to run in the background.
|
|
|
|
|
2023-03-04 14:35:50 +01:00
|
|
|
|
2023-03-04 16:06:36 +01:00
|
|
|
```
|
2023-03-04 14:35:50 +01:00
|
|
|
$ docker-compose up
|
|
|
|
```
|
|
|
|
|
2023-03-06 18:45:01 +01:00
|
|
|
5. Since we've just processed new crawl data, the system needs to construct static
|
2023-03-04 16:12:37 +01:00
|
|
|
indexes. Wait for the line 'Auto-conversion finished!'
|
2023-03-04 16:02:02 +01:00
|
|
|
|
2023-03-04 16:06:36 +01:00
|
|
|
When all is done, it should be possible to visit
|
2023-03-28 16:58:46 +02:00
|
|
|
[http://localhost:8080](http://localhost:8080) and try a few searches!
|
|
|
|
|
|
|
|
|
|
|
|
## Other Crawl Data
|
|
|
|
|
|
|
|
By default, `reconvert.sh` will load the medium dataset. This is appropriate for a demo,
|
|
|
|
but other datasets also exist.
|
|
|
|
|
|
|
|
| Set | Description |
|
|
|
|
|-----|----------------------------------------------------------------------------|
|
|
|
|
| s | 1000 domains, suitable for low-end machines |
|
|
|
|
| m | 2000 domains |
|
|
|
|
| l | 5000 domains |
|
|
|
|
| xl | 50,000 domains, basically pre-prod.<br><b>Warning</b>: 5h+ processing time |
|
|
|
|
|
|
|
|
To switch datasets, run e.g.
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ docker-compose up -d mariadb
|
|
|
|
$ ./run/reconvert.sh l
|
|
|
|
```
|
|
|
|
|
|
|
|
## Experiment Runner
|
|
|
|
|
|
|
|
The script `experiment.sh` is a launcher for the experiment runner, which is useful when
|
|
|
|
evaluating new algorithms in processing crawl data.
|