CatgirlIntelligenceAgency/run/readme.md

# Run

When developing locally, this directory will contain run-time data required for
the search engine. In a clean check-out, it only contains the tools required to 
bootstrap this directory structure.

## Requirements

While the system is designed to run bare metal in production,
for local development, you're strongly encouraged to use docker
or podman. These are a bit of a pain to install, but if you follow
[this guide](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) you're on the right track.

The system requires JDK21+, and uses Java 21 preview features. Gradle complains
a bit about this since it's not currently supported, but it works anyway.

## Set up

To go from a clean check out of the git repo to a running search engine,
follow these steps.  This assumes a test deployment.  For a production like
setup... (TODO: write a guide for this).

You're assumed to sit in the project root the whole time.

### 1. Run the one-time setup

It will create the basic runtime directory structure and download some models and 
data that doesn't come with the git repo because git deals poorly with large binary files.

```shell
$ run/setup.sh
```

### 2. Compile the project and build docker images

```shell
$ ./gradlew docker
```

### 3. Initialize the database

Before the system can be brought online, the database needs to be initialized.  To do this,
bring up the database in the background, and run the flyway migration tool.

```shell
$ docker-compose up -d mariadb
$ ./gradlew flywayMigrate
```

### 4. Bring the system online. 

We'll run it in the foreground in the terminal this time because it's educational to see the logs. 
Add `-d` to run in the background.

```shell
$ docker-compose up
```

### 5. You should now be able to access the system.

By default, the docker-compose file publishes the following ports:

| Address                 | Description      |
|-------------------------|------------------|
| http://localhost:8080/ | User-facing GUI  |
| http://localhost:8081/ | Operator's GUI   |

Note that the operator's GUI does not perform any sort of authentication.  
Preferably don't expose it publicly, but if you absolutely must, use a proxy or 
Basic Auth to add security.

### 6. Download Sample Data

A script is available for downloading sample data. The script will download the
data from https://downloads.marginalia.nu/ and extract it to the correct location.

The system will pick the data up automatically.

```shell
$ run/download-samples.sh l
```

Four sets are available:

| Name | Description                     |
|------|---------------------------------|
| s    | Small set, 1000 domains         |
| m    | Medium set, 2000 domains        |
| l    | Large set, 5000 domains         |
| xl   | Extra large set, 50,000 domains |

Warning: The XL set is intended to provide a large amount of data for 
setting up a pre-production environment. It may be hard to run on a smaller
machine and will on most machines take several hours to process.

The 'm' or 'l' sets are a good compromise between size and processing time 
and should work on most machines.

### 7. Process the data

Bring the system online if it isn't (see step 4), then go to the operator's
GUI (see step 5).  

* Go to `Node 1 -> Storage -> Crawl Data`
* Hit the toggle to set your crawl data to be active
* Go to `Actions -> Process Crawl Data -> [Trigger Reprocessing]`

This will take anywhere between a few minutes to a few hours depending on which
data set you downloaded.  You can monitor the progress from the `Overview` tab.

First the CONVERTER is expected to run; this will process the data into a format 
that can easily be inserted into the database and index.

Next the LOADER will run; this will insert the data into the database and index.

Next the link database will repartition itself, and finally the index will be
reconstructed.  You can view the process of these steps in the `Jobs` listing.

### 8. Run the system

Once all this is done, you can go to the user-facing GUI (see step 5) and try
a search.  

Important! Use the 'No Ranking' option when running locally, since you'll very
likely not have enough links for the ranking algorithm to perform well.

## Experiment Runner

The script `experiment.sh` is a launcher for the experiment runner, which is useful when 
evaluating new algorithms in processing crawl data.
WIP run and setup 2023-03-04 14:35:50 +01:00			`# Run`

			`When developing locally, this directory will contain run-time data required for`
			`the search engine. In a clean check-out, it only contains the tools required to`
			`bootstrap this directory structure.`

More documentation... 2023-03-06 18:45:01 +01:00			`## Requirements`
Update readme.md 2023-08-24 13:27:24 +02:00
WIP run and setup 2023-03-04 14:35:50 +01:00			`While the system is designed to run bare metal in production,`
			`for local development, you're strongly encouraged to use docker`
More documentation... 2023-03-06 18:45:01 +01:00			`or podman. These are a bit of a pain to install, but if you follow`
(docs) Update documentation 2023-10-27 13:24:49 +02:00			`[this guide](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) you're on the right track.`
WIP run and setup 2023-03-04 14:35:50 +01:00
(doc) Update readme 2023-11-07 16:00:18 +01:00			`The system requires JDK21+, and uses Java 21 preview features. Gradle complains`
			`a bit about this since it's not currently supported, but it works anyway.`
Update readme.md 2023-08-24 13:27:24 +02:00
More documentation... 2023-03-06 18:45:01 +01:00			`## Set up`
(docs) Update setup instructions 2023-11-30 21:44:29 +01:00
Setup readme 2023-03-04 16:12:37 +01:00			`To go from a clean check out of the git repo to a running search engine,`
(docs) Update setup instructions 2023-11-30 21:44:29 +01:00			`follow these steps. This assumes a test deployment. For a production like`
			`setup... (TODO: write a guide for this).`
WIP run and setup 2023-03-04 14:35:50 +01:00
(docs) Update setup instructions 2023-11-30 21:44:29 +01:00			`You're assumed to sit in the project root the whole time.`
Setup readme 2023-03-04 16:14:03 +01:00
(docs) Update setup instructions 2023-11-30 21:44:29 +01:00			`### 1. Run the one-time setup`

			`It will create the basic runtime directory structure and download some models and`
			`data that doesn't come with the git repo because git deals poorly with large binary files.`

			```shell
WIP run and setup 2023-03-04 14:35:50 +01:00			`$ run/setup.sh`
Setup readme 2023-03-04 16:06:36 +01:00			```

Update readme.md 2023-08-12 15:39:28 +02:00			`### 2. Compile the project and build docker images`
WIP run and setup 2023-03-04 14:35:50 +01:00
(docs) Update setup instructions 2023-11-30 21:44:29 +01:00			```shell
			`$ ./gradlew docker`
Setup readme 2023-03-04 16:06:36 +01:00			```
Amend setup instructions with command 2023-08-23 16:02:21 +02:00
Update readme.md 2023-08-12 15:39:28 +02:00			`### 3. Initialize the database`
(docs) Update setup instructions 2023-11-30 21:44:29 +01:00
			`Before the system can be brought online, the database needs to be initialized. To do this,`
			`bring up the database in the background, and run the flyway migration tool.`

			```shell
Setup readme 2023-03-04 16:06:36 +01:00			`$ docker-compose up -d mariadb`
(db) Use flwyay for database migrations. 2023-08-01 17:08:42 +02:00			`$ ./gradlew flywayMigrate`
Setup readme 2023-03-04 16:06:36 +01:00			```

(docs) Update setup instructions 2023-11-30 21:44:29 +01:00			`### 4. Bring the system online.`
Setup readme 2023-03-04 16:12:37 +01:00
(docs) Update setup instructions 2023-11-30 21:44:29 +01:00			`We'll run it in the foreground in the terminal this time because it's educational to see the logs.`
			Add `-d` to run in the background.

			```shell
WIP run and setup 2023-03-04 14:35:50 +01:00			`$ docker-compose up`
			```

Update readme.md 2023-08-12 15:39:28 +02:00			`### 5. You should now be able to access the system.`
Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
(doc) Update readme 2023-11-07 16:00:18 +01:00			`By default, the docker-compose file publishes the following ports:`

(db) Use flwyay for database migrations. 2023-08-01 17:08:42 +02:00			`\| Address \| Description \|`
			`\|-------------------------\|------------------\|`
Fix error in run/readme where it suggested local dev environment uses HTTPS 2023-08-23 15:47:39 +02:00			`\| http://localhost:8080/ \| User-facing GUI \|`
			`\| http://localhost:8081/ \| Operator's GUI \|`
Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
(doc) Update readme 2023-11-07 16:00:18 +01:00			`Note that the operator's GUI does not perform any sort of authentication.`
			`Preferably don't expose it publicly, but if you absolutely must, use a proxy or`
			`Basic Auth to add security.`
Update readme.md 2023-08-12 18:58:21 +02:00
Update readme.md 2023-08-12 15:39:28 +02:00			`### 6. Download Sample Data`
Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
(scripts\|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:47:37 +02:00			`A script is available for downloading sample data. The script will download the`
			`data from https://downloads.marginalia.nu/ and extract it to the correct location.`

			`The system will pick the data up automatically.`

			```shell
(*) WIP Add node affinity to EC_DOMAIN Very messy commit due to fractalline yak shaving 2023-10-19 13:22:52 +02:00			`$ run/download-samples.sh l`
(scripts\|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:47:37 +02:00			```

			`Four sets are available:`

			`\| Name \| Description \|`
			`\|------\|---------------------------------\|`
			`\| s \| Small set, 1000 domains \|`
			`\| m \| Medium set, 2000 domains \|`
			`\| l \| Large set, 5000 domains \|`
			`\| xl \| Extra large set, 50,000 domains \|`

			`Warning: The XL set is intended to provide a large amount of data for`
			`setting up a pre-production environment. It may be hard to run on a smaller`
(doc) Update readme 2023-11-07 16:00:18 +01:00			`machine and will on most machines take several hours to process.`
(scripts\|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:47:37 +02:00
(doc) Update readme 2023-11-07 16:00:18 +01:00			`The 'm' or 'l' sets are a good compromise between size and processing time`
			`and should work on most machines.`
Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
Update readme.md 2023-08-12 15:39:28 +02:00			`### 7. Process the data`
(docs) Update readme with up to date instructions 2023-08-11 13:42:14 +02:00
			`Bring the system online if it isn't (see step 4), then go to the operator's`
			`GUI (see step 5).`

(docs) Update documentation 2023-10-27 13:24:49 +02:00			* Go to `Node 1 -> Storage -> Crawl Data`
			`* Hit the toggle to set your crawl data to be active`
			* Go to `Actions -> Process Crawl Data -> [Trigger Reprocessing]`
(docs) Update readme with up to date instructions 2023-08-11 13:42:14 +02:00
			`This will take anywhere between a few minutes to a few hours depending on which`
(docs) Update documentation 2023-10-27 13:24:49 +02:00			data set you downloaded. You can monitor the progress from the `Overview` tab.
(docs) Update readme with up to date instructions 2023-08-11 13:42:14 +02:00
			`First the CONVERTER is expected to run; this will process the data into a format`
			`that can easily be inserted into the database and index.`

			`Next the LOADER will run; this will insert the data into the database and index.`

			`Next the link database will repartition itself, and finally the index will be`
			reconstructed. You can view the process of these steps in the `Jobs` listing.

Update readme.md 2023-08-12 15:39:28 +02:00			`### 8. Run the system`
(docs) Update readme with up to date instructions 2023-08-11 13:43:00 +02:00
(docs) Update readme with up to date instructions 2023-08-11 13:42:14 +02:00			`Once all this is done, you can go to the user-facing GUI (see step 5) and try`
			`a search.`

			`Important! Use the 'No Ranking' option when running locally, since you'll very`
			`likely not have enough links for the ranking algorithm to perform well.`

Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00			`## Experiment Runner`

			The script `experiment.sh` is a launcher for the experiment runner, which is useful when
Update readme.md 2023-08-12 15:39:28 +02:00			`evaluating new algorithms in processing crawl data.`