CatgirlIntelligenceAgency/run/readme.md

2.5 KiB

Run

When developing locally, this directory will contain run-time data required for the search engine. In a clean check-out, it only contains the tools required to bootstrap this directory structure.

Requirements

While the system is designed to run bare metal in production, for local development, you're strongly encouraged to use docker or podman. These are a bit of a pain to install, but if you follow this guide you're on the right track.

Set up

To go from a clean check out of the git repo to a running search engine, follow these steps. You're assumed to sit in the project root the whole time.

  1. Run the one-time setup, it will create the basic runtime directory structure and download some models and data that doesn't come with the git repo because git deals poorly with large binary files.
$ run/setup.sh
  1. Compile the project and build docker images
$ ./gradlew assemble docker
  1. Initialize the database
$ docker-compose up -d mariadb
$ ./gradlew flywayMigrate
  1. Bring the system online. We'll run it in the foreground in the terminal this time because it's educational to see the logs. Add -d to run in the background.
$ docker-compose up
  1. You should now be able to access the system.
Address Description
https://localhost:8080/ User-facing GUI
https://localhost:8081/ Operator's GUI
  1. Download Sample Data

A script is available for downloading sample data. The script will download the data from https://downloads.marginalia.nu/ and extract it to the correct location.

The system will pick the data up automatically.

$ run/download-samples l

Four sets are available:

Name Description
s Small set, 1000 domains
m Medium set, 2000 domains
l Large set, 5000 domains
xl Extra large set, 50,000 domains

Warning: The XL set is intended to provide a large amount of data for setting up a pre-production environment. It may be hard to run on a smaller machine. It's barely runnable on a 32GB machine; and total processing time is around 5 hours.

The 'l' set is a good compromise between size and processing time and should work on most machines.

Experiment Runner

The script experiment.sh is a launcher for the experiment runner, which is useful when evaluating new algorithms in processing crawl data.