2023-03-04 14:35:50 +01:00
|
|
|
# Run
|
|
|
|
|
|
|
|
When developing locally, this directory will contain run-time data required for
|
|
|
|
the search engine. In a clean check-out, it only contains the tools required to
|
|
|
|
bootstrap this directory structure.
|
|
|
|
|
2023-03-06 18:45:01 +01:00
|
|
|
## Requirements
|
2023-08-24 13:27:24 +02:00
|
|
|
|
2024-01-11 12:40:03 +01:00
|
|
|
**Docker** - It is a bit of a pain to install, but if you follow
|
|
|
|
[this guide](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) you're on the right track for ubuntu-like systems.
|
2023-03-04 14:35:50 +01:00
|
|
|
|
2024-01-11 12:40:03 +01:00
|
|
|
**JDK 21** - The code uses Java 21 preview features.
|
|
|
|
The civilized way of installing this is to use [SDKMAN](https://sdkman.io/);
|
|
|
|
graalce is a good distribution choice but it doesn't matter too much.
|
2023-08-24 13:27:24 +02:00
|
|
|
|
2023-03-06 18:45:01 +01:00
|
|
|
## Set up
|
2023-11-30 21:44:29 +01:00
|
|
|
|
2023-03-04 16:12:37 +01:00
|
|
|
To go from a clean check out of the git repo to a running search engine,
|
2023-11-30 21:44:29 +01:00
|
|
|
follow these steps. This assumes a test deployment. For a production like
|
|
|
|
setup... (TODO: write a guide for this).
|
2023-03-04 14:35:50 +01:00
|
|
|
|
2023-11-30 21:44:29 +01:00
|
|
|
You're assumed to sit in the project root the whole time.
|
2023-03-04 16:14:03 +01:00
|
|
|
|
2023-11-30 21:44:29 +01:00
|
|
|
### 1. Run the one-time setup
|
|
|
|
|
|
|
|
It will create the basic runtime directory structure and download some models and
|
|
|
|
data that doesn't come with the git repo because git deals poorly with large binary files.
|
|
|
|
|
|
|
|
```shell
|
2023-03-04 14:35:50 +01:00
|
|
|
$ run/setup.sh
|
2023-03-04 16:06:36 +01:00
|
|
|
```
|
|
|
|
|
2023-08-12 15:39:28 +02:00
|
|
|
### 2. Compile the project and build docker images
|
2023-03-04 14:35:50 +01:00
|
|
|
|
2023-11-30 21:44:29 +01:00
|
|
|
```shell
|
|
|
|
$ ./gradlew docker
|
2023-03-04 16:06:36 +01:00
|
|
|
```
|
2023-08-23 16:02:21 +02:00
|
|
|
|
2023-08-12 15:39:28 +02:00
|
|
|
### 3. Initialize the database
|
2023-11-30 21:44:29 +01:00
|
|
|
|
|
|
|
Before the system can be brought online, the database needs to be initialized. To do this,
|
|
|
|
bring up the database in the background, and run the flyway migration tool.
|
|
|
|
|
|
|
|
```shell
|
2023-03-04 16:06:36 +01:00
|
|
|
$ docker-compose up -d mariadb
|
2023-08-01 17:08:42 +02:00
|
|
|
$ ./gradlew flywayMigrate
|
2023-03-04 16:06:36 +01:00
|
|
|
```
|
|
|
|
|
2023-11-30 21:44:29 +01:00
|
|
|
### 4. Bring the system online.
|
2023-03-04 16:12:37 +01:00
|
|
|
|
2023-11-30 21:44:29 +01:00
|
|
|
We'll run it in the foreground in the terminal this time because it's educational to see the logs.
|
|
|
|
Add `-d` to run in the background.
|
|
|
|
|
|
|
|
```shell
|
2023-03-04 14:35:50 +01:00
|
|
|
$ docker-compose up
|
|
|
|
```
|
|
|
|
|
2024-01-11 09:43:08 +01:00
|
|
|
There are two docker-compose files available, `docker-compose.yml` and `docker-compose-barebones.yml`;
|
|
|
|
the latter is a stripped down version that only runs the bare minimum required to run the system, for e.g.
|
|
|
|
running a whitelabel version of the system. The former is the full system with all the frills of
|
|
|
|
Marginalia Search, and is the one used by default.
|
|
|
|
|
|
|
|
To start the barebones version, run:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
$ docker-compose -f docker-compose-barebones.yml up
|
|
|
|
```
|
|
|
|
|
2023-08-12 15:39:28 +02:00
|
|
|
### 5. You should now be able to access the system.
|
2023-03-28 16:58:46 +02:00
|
|
|
|
2023-11-07 16:00:18 +01:00
|
|
|
By default, the docker-compose file publishes the following ports:
|
|
|
|
|
2023-08-01 17:08:42 +02:00
|
|
|
| Address | Description |
|
|
|
|
|-------------------------|------------------|
|
2023-08-23 15:47:39 +02:00
|
|
|
| http://localhost:8080/ | User-facing GUI |
|
|
|
|
| http://localhost:8081/ | Operator's GUI |
|
2023-03-28 16:58:46 +02:00
|
|
|
|
2023-11-07 16:00:18 +01:00
|
|
|
Note that the operator's GUI does not perform any sort of authentication.
|
|
|
|
Preferably don't expose it publicly, but if you absolutely must, use a proxy or
|
|
|
|
Basic Auth to add security.
|
2023-08-12 18:58:21 +02:00
|
|
|
|
2023-08-12 15:39:28 +02:00
|
|
|
### 6. Download Sample Data
|
2023-03-28 16:58:46 +02:00
|
|
|
|
2023-08-01 22:47:37 +02:00
|
|
|
A script is available for downloading sample data. The script will download the
|
|
|
|
data from https://downloads.marginalia.nu/ and extract it to the correct location.
|
|
|
|
|
|
|
|
The system will pick the data up automatically.
|
|
|
|
|
|
|
|
```shell
|
2023-10-19 13:22:52 +02:00
|
|
|
$ run/download-samples.sh l
|
2023-08-01 22:47:37 +02:00
|
|
|
```
|
|
|
|
|
|
|
|
Four sets are available:
|
|
|
|
|
|
|
|
| Name | Description |
|
|
|
|
|------|---------------------------------|
|
|
|
|
| s | Small set, 1000 domains |
|
|
|
|
| m | Medium set, 2000 domains |
|
|
|
|
| l | Large set, 5000 domains |
|
|
|
|
| xl | Extra large set, 50,000 domains |
|
|
|
|
|
|
|
|
Warning: The XL set is intended to provide a large amount of data for
|
|
|
|
setting up a pre-production environment. It may be hard to run on a smaller
|
2023-11-07 16:00:18 +01:00
|
|
|
machine and will on most machines take several hours to process.
|
2023-08-01 22:47:37 +02:00
|
|
|
|
2023-11-07 16:00:18 +01:00
|
|
|
The 'm' or 'l' sets are a good compromise between size and processing time
|
|
|
|
and should work on most machines.
|
2023-03-28 16:58:46 +02:00
|
|
|
|
2023-08-12 15:39:28 +02:00
|
|
|
### 7. Process the data
|
2023-08-11 13:42:14 +02:00
|
|
|
|
|
|
|
Bring the system online if it isn't (see step 4), then go to the operator's
|
|
|
|
GUI (see step 5).
|
|
|
|
|
2023-10-27 13:24:49 +02:00
|
|
|
* Go to `Node 1 -> Storage -> Crawl Data`
|
|
|
|
* Hit the toggle to set your crawl data to be active
|
|
|
|
* Go to `Actions -> Process Crawl Data -> [Trigger Reprocessing]`
|
2023-08-11 13:42:14 +02:00
|
|
|
|
|
|
|
This will take anywhere between a few minutes to a few hours depending on which
|
2023-10-27 13:24:49 +02:00
|
|
|
data set you downloaded. You can monitor the progress from the `Overview` tab.
|
2023-08-11 13:42:14 +02:00
|
|
|
|
|
|
|
First the CONVERTER is expected to run; this will process the data into a format
|
|
|
|
that can easily be inserted into the database and index.
|
|
|
|
|
|
|
|
Next the LOADER will run; this will insert the data into the database and index.
|
|
|
|
|
|
|
|
Next the link database will repartition itself, and finally the index will be
|
|
|
|
reconstructed. You can view the process of these steps in the `Jobs` listing.
|
|
|
|
|
2023-08-12 15:39:28 +02:00
|
|
|
### 8. Run the system
|
2023-08-11 13:43:00 +02:00
|
|
|
|
2023-08-11 13:42:14 +02:00
|
|
|
Once all this is done, you can go to the user-facing GUI (see step 5) and try
|
|
|
|
a search.
|
|
|
|
|
|
|
|
Important! Use the 'No Ranking' option when running locally, since you'll very
|
|
|
|
likely not have enough links for the ranking algorithm to perform well.
|
|
|
|
|
2023-03-28 16:58:46 +02:00
|
|
|
## Experiment Runner
|
|
|
|
|
|
|
|
The script `experiment.sh` is a launcher for the experiment runner, which is useful when
|
2023-08-12 15:39:28 +02:00
|
|
|
evaluating new algorithms in processing crawl data.
|