CatgirlIntelligenceAgency/run
2023-07-24 15:25:09 +02:00
..
dist (control, WIP) MQFSM and ProcessService are sitting in a tree 2023-07-11 17:08:43 +02:00
env (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
template/conf WIP run and setup 2023-03-04 14:42:24 +01:00
test-data Make the code run properly without WMSA_HOME set, adding missing test assets. 2023-03-05 13:47:40 +01:00
.gitignore Restructuring the git repo 2023-03-04 13:19:01 +01:00
experiment.sh Tell experiment runner to only process some domains. 2023-06-20 14:14:01 +02:00
nginx-site.conf (run) Reduce nginx access log noise for local setup 2023-07-11 23:11:34 +02:00
readme.md Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
reconvert.sh "-Dsmall-ram=TRUE" no longer does anything. Remove references to the flag, which previously reduced the memory footprint of the loader and index service. 2023-03-26 21:37:11 +02:00
setup.sh (*) File Storage WIP 2023-07-14 17:08:10 +02:00

Run

When developing locally, this directory will contain run-time data required for the search engine. In a clean check-out, it only contains the tools required to bootstrap this directory structure.

Requirements

While the system is designed to run bare metal in production, for local development, you're strongly encouraged to use docker or podman. These are a bit of a pain to install, but if you follow this guide you're on the right track.

Set up

To go from a clean check out of the git repo to a running search engine, follow these steps. You're assumed to sit in the project root the whole time.

  1. Run the one-time setup, it will create the basic runtime directory structure and download some models and data that doesn't come with the git repo.
$ run/setup.sh
  1. Compile the project and build docker images
$ ./gradlew assemble docker
  1. Download a sample of crawl data, process it and stick the metadata into the database. The data is only downloaded once. Grab a cup of coffee, this takes a few minutes. This needs to be done whenever the crawler or processor has changed.
$ docker-compose up -d mariadb
$ run/reconvert.sh
  1. Bring the system online. We'll run it in the foreground in the terminal this time because it's educational to see the logs. Add -d to run in the background.
$ docker-compose up
  1. Since we've just processed new crawl data, the system needs to construct static indexes. Wait for the line 'Auto-conversion finished!'

When all is done, it should be possible to visit http://localhost:8080 and try a few searches!

Other Crawl Data

By default, reconvert.sh will load the medium dataset. This is appropriate for a demo, but other datasets also exist.

Set Description
s 1000 domains, suitable for low-end machines
m 2000 domains
l 5000 domains
xl 50,000 domains, basically pre-prod.
Warning: 5h+ processing time

To switch datasets, run e.g.

$ docker-compose up -d mariadb
$ ./run/reconvert.sh l

Experiment Runner

The script experiment.sh is a launcher for the experiment runner, which is useful when evaluating new algorithms in processing crawl data.