History

Viktor Lofgren dbe9235f3a (*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.		2023-09-24 10:38:59 +02:00
..
dist	(control, WIP) MQFSM and ProcessService are sitting in a tree	2023-07-11 17:08:43 +02:00
env	(process) Propagate environment JVM params to the index constructor	2023-09-01 15:39:42 +02:00
template/conf	(conf) Change default user-agent to not associate it with the project; remove unused disks.properties file.	2023-08-01 17:34:25 +02:00
test-data	Make the code run properly without WMSA_HOME set, adding missing test assets.	2023-03-05 13:47:40 +01:00
.gitignore	Restructuring the git repo	2023-03-04 13:19:01 +01:00
download-samples.sh	(scripts\|docs) Update scripts and documentations for the new operator's gui and file storage workflows.	2023-08-01 22:47:37 +02:00
experiment.sh	(keyword-extraction) Fix bug leading to position data missing on some keywords.	2023-09-02 14:48:55 +02:00
nginx-site.conf	(run) Reduce nginx access log noise for local setup	2023-07-11 23:11:34 +02:00
readme.md	(*) Upgrade to JDK21 with preview enabled.	2023-09-24 10:38:59 +02:00
setup.sh	(index,control) Recoverable index backups	2023-08-25 14:57:43 +02:00

readme.md

Run

When developing locally, this directory will contain run-time data required for the search engine. In a clean check-out, it only contains the tools required to bootstrap this directory structure.

Requirements

While the system is designed to run bare metal in production, for local development, you're strongly encouraged to use docker or podman. These are a bit of a pain to install, but if you follow this guide you're on the right track.

The system requires JDK21+, and uses preview features.

Set up

To go from a clean check out of the git repo to a running search engine, follow these steps. You're assumed to sit in the project root the whole time.

1. Run the one-time setup, it will create the

basic runtime directory structure and download some models and data that doesn't come with the git repo because git deals poorly with large binary files.

$ run/setup.sh

2. Compile the project and build docker images

$ ./gradlew dist docker

dist is necessary for the processes to be possible to start and run.
docker is necessary for the services to be possible to start and run.

3. Initialize the database

$ docker-compose up -d mariadb
$ ./gradlew flywayMigrate

4. Bring the system online. We'll run it in the foreground in the terminal this time

because it's educational to see the logs. Add -d to run in the background.

$ docker-compose up

5. You should now be able to access the system.

Address	Description
http://localhost:8080/	User-facing GUI
http://localhost:8081/	Operator's GUI

Note that the operator's GUI does not perform any sort of authentication. Preferrably don't expose it publicly, but if you absolutely must, use a proxy or Basic Auth to add security.

6. Download Sample Data

A script is available for downloading sample data. The script will download the data from https://downloads.marginalia.nu/ and extract it to the correct location.

The system will pick the data up automatically.

$ run/download-samples l

Four sets are available:

Name	Description
s	Small set, 1000 domains
m	Medium set, 2000 domains
l	Large set, 5000 domains
xl	Extra large set, 50,000 domains

Warning: The XL set is intended to provide a large amount of data for setting up a pre-production environment. It may be hard to run on a smaller machine. It's barely runnable on a 32GB machine; and total processing time is around 5 hours.

The 'l' set is a good compromise between size and processing time and should work on most machines.

7. Process the data

Bring the system online if it isn't (see step 4), then go to the operator's GUI (see step 5).

Go to Storage
Go to Crawl Data
Find the data set you want to process and click [Info]
Click [Process and load]

This will take anywhere between a few minutes to a few hours depending on which data set you downloaded. You can monitor the progress from the Overview tab under Processes.

First the CONVERTER is expected to run; this will process the data into a format that can easily be inserted into the database and index.

Next the LOADER will run; this will insert the data into the database and index.

Next the link database will repartition itself, and finally the index will be reconstructed. You can view the process of these steps in the Jobs listing.

8. Run the system

Once all this is done, you can go to the user-facing GUI (see step 5) and try a search.

Important! Use the 'No Ranking' option when running locally, since you'll very likely not have enough links for the ranking algorithm to perform well.

Experiment Runner

The script experiment.sh is a launcher for the experiment runner, which is useful when evaluating new algorithms in processing crawl data.