History

Viktor Lofgren cdfe284f9a (file storage) File Storage Type for EXPORT data (file storage) File Storage Type for EXPORT data		2023-08-05 14:45:03 +02:00
..
dist	(control, WIP) MQFSM and ProcessService are sitting in a tree	2023-07-11 17:08:43 +02:00
env	Fix environment variables to processes so jmc works	2023-07-31 10:32:23 +02:00
template/conf	(conf) Change default user-agent to not associate it with the project; remove unused disks.properties file.	2023-08-01 17:34:25 +02:00
test-data	Make the code run properly without WMSA_HOME set, adding missing test assets.	2023-03-05 13:47:40 +01:00
.gitignore	Restructuring the git repo	2023-03-04 13:19:01 +01:00
download-samples.sh	(scripts\|docs) Update scripts and documentations for the new operator's gui and file storage workflows.	2023-08-01 22:47:37 +02:00
experiment.sh	Tell experiment runner to only process some domains.	2023-06-20 14:14:01 +02:00
nginx-site.conf	(run) Reduce nginx access log noise for local setup	2023-07-11 23:11:34 +02:00
readme.md	(scripts\|docs) Update scripts and documentations for the new operator's gui and file storage workflows.	2023-08-01 22:47:37 +02:00
setup.sh	(file storage) File Storage Type for EXPORT data	2023-08-05 14:45:03 +02:00

readme.md

Run

When developing locally, this directory will contain run-time data required for the search engine. In a clean check-out, it only contains the tools required to bootstrap this directory structure.

Requirements

While the system is designed to run bare metal in production, for local development, you're strongly encouraged to use docker or podman. These are a bit of a pain to install, but if you follow this guide you're on the right track.

Set up

To go from a clean check out of the git repo to a running search engine, follow these steps. You're assumed to sit in the project root the whole time.

Run the one-time setup, it will create the basic runtime directory structure and download some models and data that doesn't come with the git repo because git deals poorly with large binary files.

$ run/setup.sh

Compile the project and build docker images

$ ./gradlew assemble docker

Initialize the database

$ docker-compose up -d mariadb
$ ./gradlew flywayMigrate

Bring the system online. We'll run it in the foreground in the terminal this time because it's educational to see the logs. Add -d to run in the background.

$ docker-compose up

You should now be able to access the system.

Address	Description
https://localhost:8080/	User-facing GUI
https://localhost:8081/	Operator's GUI

Download Sample Data

A script is available for downloading sample data. The script will download the data from https://downloads.marginalia.nu/ and extract it to the correct location.

The system will pick the data up automatically.

$ run/download-samples l

Four sets are available:

Name	Description
s	Small set, 1000 domains
m	Medium set, 2000 domains
l	Large set, 5000 domains
xl	Extra large set, 50,000 domains

Warning: The XL set is intended to provide a large amount of data for setting up a pre-production environment. It may be hard to run on a smaller machine. It's barely runnable on a 32GB machine; and total processing time is around 5 hours.

The 'l' set is a good compromise between size and processing time and should work on most machines.

Experiment Runner

The script experiment.sh is a launcher for the experiment runner, which is useful when evaluating new algorithms in processing crawl data.