CatgirlIntelligenceAgency/run/readme.md

78 lines
2.6 KiB
Markdown
Raw Normal View History

2023-03-04 14:35:50 +01:00
# Run
When developing locally, this directory will contain run-time data required for
the search engine. In a clean check-out, it only contains the tools required to
bootstrap this directory structure.
2023-03-06 18:45:01 +01:00
## Requirements
2023-03-04 14:35:50 +01:00
While the system is designed to run bare metal in production,
for local development, you're strongly encouraged to use docker
2023-03-06 18:45:01 +01:00
or podman. These are a bit of a pain to install, but if you follow
[this guide](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository)
you're on the right track.
2023-03-04 14:35:50 +01:00
2023-03-06 18:45:01 +01:00
## Set up
2023-03-04 16:12:37 +01:00
To go from a clean check out of the git repo to a running search engine,
follow these steps. You're assumed to sit in the project root the whole time.
2023-03-04 14:35:50 +01:00
2023-03-04 16:12:37 +01:00
1. Run the one-time setup, it will create the
2023-03-04 16:14:03 +01:00
basic runtime directory structure and download some models and data that doesn't
come with the git repo.
2023-03-04 14:35:50 +01:00
```
$ run/setup.sh
2023-03-04 16:06:36 +01:00
```
2023-03-04 16:12:37 +01:00
2. Compile the project and build docker images
2023-03-04 14:35:50 +01:00
2023-03-04 16:06:36 +01:00
```
2023-03-04 15:17:02 +01:00
$ ./gradlew assemble docker
2023-03-04 16:06:36 +01:00
```
2023-03-04 15:17:02 +01:00
2023-03-04 16:12:37 +01:00
3. Download a sample of crawl data, process it and stick the metadata
into the database. The data is only downloaded once. Grab a cup of coffee, this takes a few minutes.
This needs to be done whenever the crawler or processor has changed.
2023-03-04 16:06:36 +01:00
```
$ docker-compose up -d mariadb
2023-03-04 14:35:50 +01:00
$ run/reconvert.sh
2023-03-04 16:06:36 +01:00
```
2023-03-04 16:12:37 +01:00
4. Bring the system online. We'll run it in the foreground in the terminal this time
because it's educational to see the logs. Add `-d` to run in the background.
2023-03-04 14:35:50 +01:00
2023-03-04 16:06:36 +01:00
```
2023-03-04 14:35:50 +01:00
$ docker-compose up
```
2023-03-06 18:45:01 +01:00
5. Since we've just processed new crawl data, the system needs to construct static
2023-03-04 16:12:37 +01:00
indexes. Wait for the line 'Auto-conversion finished!'
2023-03-04 16:06:36 +01:00
When all is done, it should be possible to visit
[http://localhost:8080](http://localhost:8080) and try a few searches!
## Other Crawl Data
By default, `reconvert.sh` will load the medium dataset. This is appropriate for a demo,
but other datasets also exist.
| Set | Description |
|-----|----------------------------------------------------------------------------|
| s | 1000 domains, suitable for low-end machines |
| m | 2000 domains |
| l | 5000 domains |
| xl | 50,000 domains, basically pre-prod.<br><b>Warning</b>: 5h+ processing time |
To switch datasets, run e.g.
```shell
$ docker-compose up -d mariadb
$ ./run/reconvert.sh l
```
## Experiment Runner
The script `experiment.sh` is a launcher for the experiment runner, which is useful when
evaluating new algorithms in processing crawl data.