(doc) Migrate documentation https://docs.marginalia.nu/

This commit is contained in:
Viktor Lofgren 2024-01-22 19:40:08 +01:00
parent a6d257df5b
commit 562012fb22
19 changed files with 46 additions and 1026 deletions

View File

@ -13,12 +13,21 @@ The long term plan is to refine the search engine so that it provide enough publ
that the project can be funded through grants, donations and commercial API licenses
(non-commercial share-alike is always free).
The system can both be run as a copy of Marginalia Search, or as a white-label search engine
for your own data (either crawled or side-loaded). At present the logic isn't very configurable, and a lot of the judgements
made are based on the Marginalia project's goals, but additional configurability is being
worked on!
## Set up
Start by running [⚙️ run/setup.sh](run/setup.sh). This will download supplementary model data that is necessary to run the code.
To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)!
Further documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/).
Before compiling, it's necessary to run [⚙️ run/setup.sh](run/setup.sh).
This will download supplementary model data that is necessary to run the code.
These are also necessary to run the tests.
To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)!
## Hardware Requirements

View File

@ -1,115 +0,0 @@
# Crawling
## WARNING
Please don't run the crawler unless you intend to actually operate a public
facing search engine! For testing, use crawl sets from [downloads.marginalia.nu](https://downloads.marginalia.nu/) instead;
or if you wish to play with the crawler, crawl a small set of domains from people who are
ok with it, use your own, your friends, or any subdomain from marginalia.nu.
See the documentation in run/ for more information on how to load sample data!
Reckless crawling annoys webmasters and makes it harder to run an independent search engine.
Crawling from a domestic IP address is also likely to put you on a greylist
of probable bots. You will solve CAPTCHAs for almost every website you visit
for weeks, and may be permanently blocked from a few IPs.
## Prerequisites
You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of
DNS traffic.
These processes require a lot of disk space. It's strongly recommended to use a dedicated disk for
the index storage subdirectory, it doesn't need to be extremely fast, but it should be a few terabytes in size.
It should be mounted with `noatime`. It may be a good idea to format the disk with a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler.
Make sure you configure the user-agent properly. This will be used to identify the crawler,
and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it.
See [wiki://Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard) for more information
about robots.txt; the user agent can be configured in conf/properties/system.properties; see the
[system-properties](system-properties.md) documentation for more information.
## Setup
Ensure that the system is running and go to https://localhost:8081.
With the default test configuration, the system is configured to
store data in `node-1/storage`.
## Fresh Crawl
While a running search engine can use the link database to figure out which websites to visit, a clean
system does not know of any links. To bootstrap a crawl, a crawl specification needs to be created to
seed the domain database.
Go to `Nodes->Node 1->Actions->New Crawl`
![img](images/new_crawl.png)
Click the link that says 'New Spec' to arrive at a form for creating a new specification:
![img](images/new_spec.png)
Fill out the form with a description and a link to a domain list. The domain list is a text file
with one domain per line, with blank lines and comments starting with `#` ignored. You can use
github raw links for this purpose. For test purposes, you can use this link:
`https://downloads.marginalia.nu/domain-list-test.txt`, which will create a crawl for a few
of marignalia.nu's subdomains.
If you aren't redirected there automatically, go back to the `New Crawl` page under Node 1 -> Actions.
Your new specification should now be listed.
Check the box next to it, and click `[Trigger New Crawl]`.
![img](images/new_crawl2.png)
This will start the crawling process. Crawling may take a while, depending on the size
of the domain list and the size of the websites.
![img](images/crawl_in_progress.png)
Eventually a process bar will show up, and the crawl will start. When it reaches 100%, the crawl is done.
You can also monitor the `Events Summary` table on the same page to see what happened after the fact.
It is expected that the crawl will stall out toward the end of the process, this is a statistical effect since
the largest websites take the longest to finish, and tend to be the ones lingering at 99% or so completion. The
crawler has a timeout of 5 hours, where if no new domains are finished crawling, it will stop, to prevent crawler traps
from stalling the crawl indefinitely.
**Be sure to read the section on re-crawling!**
## Converting
Once the crawl is done, the data needs to be processed before its searchable. This is done by going to
`Nodes->Node 1->Actions->Process Crawl Data`.
![Conversion screenshot](images/convert.png)
This will start the conversion process. This will again take a while, depending on the size of the crawl.
The process bar will show the progress. When it reaches 100%, the conversion is done, and the data will begin
loading automatically. A cascade of actions is performed in sequence, leading to the data being loaded into the
search engine and an index being constructed. This is all automatic, but depending on the size of the crawl data,
may take a while.
When an event `INDEX-SWITCH-OK` is logged in the `Event Summary` table, the data is ready to be searched.
## Re-crawling
The work flow with a crawl spec was a one-off process to bootstrap the search engine. To keep the search engine up to date,
it is preferable to do a recrawl. This will try to reduce the amount of data that needs to be fetched.
To trigger a Recrawl, go to `Nodes->Node 1->Actions->Re-crawl`. This will bring you to a page that looks similar to the
first crawl page, where you can select a set of crawl data to use as a source. Select the crawl data you want, and
press `[Trigger Recrawl]`.
Crawling will proceed as before, but this time, the crawler will try to fetch only the data that has changed since the
last crawl, increasing the number of documents by a percentage. This will typically be much faster than the initial crawl.
### Growing the crawl set
The re-crawl will also pull new domains from the `New Domains` dataset, which is an URL configurable in
`[Top Menu] -> System -> Data Sets`. If a new domain is found, it will be assigned to the present node, and crawled in
the re-crawl.
![Datasets screenshot](images/datasets.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 44 KiB

View File

@ -3,12 +3,11 @@
A lot of the architectural description is sprinkled into the code repository closer to the code.
Start in [📁 ../code/](../code/) and poke around.
Operational documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/).
## Operations
* [System Properties](system-properties.md) - JVM property flags
## How-To
* [Sideloading How-To](sideloading-howto.md) - How to sideload various data sets
* [Parquet How-To](parquet-howto.md) - Useful tips in working with Parquet files
## Set-up

View File

@ -1,211 +0,0 @@
# Sideloading How-To
Some websites are much larger than others, this includes
Wikipedia, Stack Overflow, and a few others. They are so
large they are impractical to crawl in the traditional fashion,
but luckily they make available data dumps that can be processed
and loaded into the search engine through other means.
To this end, it's possible to sideload data into the search engine
from other sources than the web crawler.
## Index Nodes
In practice, if you want to sideload data, you need to do it on
a separate index node. Index nodes are separate instances of the
index software. The default configuration is to have two index nodes,
one for the web crawler, and one for sideloaded data.
The need for a separate node is due to incompatibilities in the work flows.
It is also a good idea in general, as very large domains can easily be so large that the entire time budget
for the query is spent sifting through documents from that one domain, this is
especially true with something like Wikipedia, which has a lot of documents at
least tangentially related to any given topic.
This how-to assumes that you are operating on index-node 2.
## Notes on the upload directory
This is written assuming that the system is installed with the `install.sh`
script, which deploys the system with docker-compose, and has a directory
structure like
```
...
index-1/backup/
index-1/index/
index-1/storage/
index-1/uploads/
index-1/work/
index-2/backup/
index-2/index/
index-2/storage/
index-2/uploads/
index-2/work/
...
```
We're going to be putting files in the **uploads** directories. If you have installed
the system in some other way, or changed the configuration significantly, you need
to adjust the paths accordingly.
## Sideloading
The sideloading actions are available through Actions menu in each node.
![Sideload menu](images/sideload_menu.png)
## Sideloading WARCs
WARC files are the standard format for web archives. They can be created e.g. with wget.
The Marginalia software can read WARC files directly, and sideload them into the index,
as long as each warc file contains only one domain.
Let's for example archive www.marginalia.nu (I own this domain, so feel free to try this at home)
```bash
$ wget -r --warc-file=marginalia www.marginalia.nu
```
**Note** If you intend to do this on other websites, you should probably add a `--wait` parameter to wget,
e.g. `wget --wait=1 -r --warc-file=...` to avoid hammering the website with requests and getting blocked.
This will take a moment, and create a file called `marginalia.warc.gz`. We move it to the
upload directory of the index node, and sideload it through the Actions menu.
```bash
$ mkdir -p index-2/uploads/marginalia-warc
$ mv marginalia.warc.gz index-2/uploads/marginalia-warc
```
Go to the Actions menu, and select the "Sideload WARC" action. This will show a list of
subdirectories in the Uploads directory. Select the directory containing the WARC file, and
click "Sideload".
![Sideload WARC screenshot](images/sideload_warc.png)
This should take you to the node overview, where you can see the progress of the sideloading.
It will take a moment, as the WARC file is being processed.
![Processing in progress](images/convert_2.png)
It will not be loaded automatically. This is to permit you to sideload multiple sources.
When you are ready to load it, go to the Actions menu, and select "Load Crawl Data".
![Load Crawl Data](images/load_warc.png)
Select all the sources you want to load, and click "Load". This will load the data into the
index, and make it available for searching.
## Sideloading Wikipedia
Due to licensing incompatibilities with OpenZim's GPL-2 and AGPL, the workflow
depends on using the conversion process from [https://encyclopedia.marginalia.nu/](https://encyclopedia.marginalia.nu/)
to pre-digest the data.
Build the [encyclopedia.marginalia.nu Code](https://github.com/MarginaliaSearch/encyclopedia.marginalia.nu)
and follow the instructions for downloading a ZIM file, and then run something like
```$./encyclopedia convert file.zim articles.db```
This db-file can be processed and loaded into the search engine through the
Actions view.
FIXME: It will currently only point to en.wikipedia.org, this should be
made configurable.
## Sideloading a directory tree
For relatively small websites, ad-hoc side-loading is available directly from a
folder structure on the hard drive. This is intended for loading manuals,
documentation and similar data sets that are large and slowly changing.
A website can be archived with wget, like this
```bash
UA="search.marginalia.nu" \
DOMAIN="www.example.com" \
wget -nc -x --continue -w 1 -r -U ${UA} -A "html" ${DOMAIN}
```
After doing this to a bunch of websites, create a YAML file something like this:
```yaml
sources:
- name: jdk-20
dir: "jdk-20/"
domainName: "docs.oracle.com"
baseUrl: "https://docs.oracle.com/en/java/javase/20/docs"
keywords:
- "java"
- "docs"
- "documentation"
- "javadoc"
- name: python3
dir: "python-3.11.5/"
domainName: "docs.python.org"
baseUrl: "https://docs.python.org/3/"
keywords:
- "python"
- "docs"
- "documentation"
- name: mariadb.com
dir: "mariadb.com/"
domainName: "mariadb.com"
baseUrl: "https://mariadb.com/"
keywords:
- "sql"
- "docs"
- "mariadb"
- "mysql"
```
|parameter|description|
|----|----|
|name|Purely informative|
|dir|Path of website contents relative to the location of the yaml file|
|domainName|The domain name of the website|
|baseUrl|This URL will be prefixed to the contents of `dir`|
|keywords|These supplemental keywords will be injected in each document|
The directory structure corresponding to the above might look like
```
docs-index.yaml
jdk-20/
jdk-20/resources/
jdk-20/api/
jdk-20/api/[...]
jdk-20/specs/
jdk-20/specs/[...]
jdk-20/index.html
mariadb.com
mariadb.com/kb/
mariadb.com/kb/[...]
python-3.11.5
python-3.11.5/genindex-B.html
python-3.11.5/library/
python-3.11.5/distutils/
python-3.11.5/[...]
[...]
```
This yaml-file can be processed and loaded into the search engine through the
Actions view.
## Sideloading Stack Overflow/Stackexchange
Stackexchange makes dumps available on Archive.org. These are unfortunately on a format that
needs some heavy-handed pre-processing before they can be loaded. A tool is available for
this in [tools/stackexchange-converter](../code/tools/stackexchange-converter).
After running `gradlew dist`, this tool is found in `build/dist/stackexchange-converter`,
follow the instructions in the stackexchange-converter readme, and
convert the stackexchange xml.7z-files to sqlite db-files.
A directory with such db-files can be processed and loaded into the
search engine through the Actions view.

View File

@ -1,42 +0,0 @@
# System Properties
These are JVM system properties used by each service. These properties can either
be loaded from a file or passed in as command line arguments, using `$JAVA_OPTS`.
The system will look for a properties file in `conf/properties/system.properties`,
within the install dir, as specified by `$WMSA_HOME`.
A template is available in [../run/template/conf/properties/system.properties](../run/template/conf/properties/system.properties).
## Global
| flag | values | description |
|-------------|------------|--------------------------------------|
| blacklist.disable | boolean | Disables the IP blacklist |
| flyway.disable | boolean | Disables automatic Flyway migrations |
## Crawler Properties
| flag | values | description |
|------------------------------|------------|---------------------------------------------------------------------------------------------|
| crawler.userAgentString | string | Sets the user agent string used by the crawler |
| crawler.userAgentIdentifier | string | Sets the user agent identifier used by the crawler, e.g. what it looks for in robots.txt |
| crawler.poolSize | integer | Sets the number of threads used by the crawler, more is faster, but uses more RAM |
| crawler.initialUrlsPerDomain | integer | Sets the initial number of URLs to crawl per domain (when crawling from spec) |
| crawler.maxUrlsPerDomain | integer | Sets the maximum number of URLs to crawl per domain (when recrawling) |
| crawler.minUrlsPerDomain | integer | Sets the minimum number of URLs to crawl per domain (when recrawling) |
| crawler.crawlSetGrowthFactor | double | If 100 documents were fetched last crawl, increase the goal to 100 x (this value) this time |
| ip-blocklist.disabled | boolean | Disables the IP blocklist |
## Converter Properties
| flag | values | description |
|-----------------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| converter.sideloadThreshold | integer | Threshold value, in number of documents per domain, where a simpler processing method is used which uses less RAM. 10,000 is a good value for ~32GB RAM |
# Marginalia Application Specific
| flag | values | description |
|---------------------------|------------|---------------------------------------------------------------|
| search.websiteUrl | string | Overrides the website URL used in rendering |
| control.hideMarginaliaApp | boolean | Hides the Marginalia application from the control GUI results |

View File

@ -1,181 +0,0 @@
# This is the barebones docker-compose file for the Marginalia Search Engine.
#
# It starts a stripped-down version of the search engine, with only the essential
# services running, including the database, the query service, the control service,
# and a single index and executor node.
#
# It is a good starting point for setting up a white-label search engine that does not
# have Marginalia's GUI. The Query Service presents a simple search box, that also talks
# JSON, so you can use it as a backend for your own search interface.
x-svc: &service
env_file:
- "run/env/service.env"
volumes:
- conf:/wmsa/conf:ro
- model:/wmsa/model
- data:/wmsa/data
- logs:/var/log/wmsa
networks:
- wmsa
depends_on:
- mariadb
labels:
- "__meta_docker_port_private=7000"
x-p1: &partition-1
env_file:
- "run/env/service.env"
volumes:
- conf:/wmsa/conf:ro
- model:/wmsa/model
- data:/wmsa/data
- logs:/var/log/wmsa
- index-1:/idx
- work-1:/work
- backup-1:/backup
- samples-1:/storage
- uploads-1:/uploads
networks:
- wmsa
depends_on:
- mariadb
environment:
- "WMSA_SERVICE_NODE=1"
services:
index-service-1:
<<: *partition-1
image: "marginalia/index-service"
container_name: "index-service-1"
executor-service-1:
<<: *partition-1
image: "marginalia/executor-service"
container_name: "executor-service-1"
query-service:
<<: *service
image: "marginalia/query-service"
container_name: "query-service"
expose:
- 80
labels:
- "traefik.enable=true"
- "traefik.http.routers.search-service.rule=PathPrefix(`/`)"
- "traefik.http.routers.search-service.entrypoints=search"
- "traefik.http.routers.search-service.middlewares=add-xpublic"
- "traefik.http.routers.search-service.middlewares=add-public"
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
control-service:
<<: *service
image: "marginalia/control-service"
container_name: "control-service"
expose:
- 80
labels:
- "traefik.enable=true"
- "traefik.http.routers.control-service.rule=PathPrefix(`/`)"
- "traefik.http.routers.control-service.entrypoints=control"
- "traefik.http.routers.control-service.middlewares=add-xpublic"
- "traefik.http.routers.control-service.middlewares=add-public"
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
mariadb:
image: "mariadb:lts"
container_name: "mariadb"
env_file: "run/env/mariadb.env"
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
ports:
- "127.0.0.1:3306:3306/tcp"
healthcheck:
test: mysqladmin ping -h 127.0.0.1 -u $$MARIADB_USER --password=$$MARIADB_PASSWORD
start_period: 5s
interval: 5s
timeout: 5s
retries: 60
volumes:
- db:/var/lib/mysql
- "./code/common/db/src/main/resources/sql/current/:/docker-entrypoint-initdb.d/"
networks:
- wmsa
traefik:
image: "traefik:v2.10"
container_name: "traefik"
command:
#- "--log.level=DEBUG"
- "--api.insecure=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--entrypoints.search.address=:80"
- "--entrypoints.control.address=:81"
ports:
- "127.0.0.1:8080:80"
- "127.0.0.1:8081:81"
- "127.0.0.1:8090:8080"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:ro"
networks:
- wmsa
networks:
wmsa:
volumes:
db:
driver: local
driver_opts:
type: none
o: bind
device: run/db
logs:
driver: local
driver_opts:
type: none
o: bind
device: run/logs
model:
driver: local
driver_opts:
type: none
o: bind
device: run/model
conf:
driver: local
driver_opts:
type: none
o: bind
device: run/conf
data:
driver: local
driver_opts:
type: none
o: bind
device: run/data
samples-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/samples
index-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/index
work-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/work
backup-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/backup
uploads-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/uploads

View File

@ -1,315 +0,0 @@
# This is the full docker-compose.yml file for the Marginalia Search Engine.
#
# It starts all the services, including the GUI, the database, the query service,
# two nodes for demo purposes, as well as a bunch of peripheral services that are
# application specific.
#
x-svc: &service
env_file:
- "run/env/service.env"
volumes:
- conf:/wmsa/conf:ro
- model:/wmsa/model
- data:/wmsa/data
- logs:/var/log/wmsa
networks:
- wmsa
labels:
- "__meta_docker_port_private=7000"
x-p1: &partition-1
env_file:
- "run/env/service.env"
volumes:
- conf:/wmsa/conf:ro
- model:/wmsa/model
- data:/wmsa/data
- logs:/var/log/wmsa
- index-1:/idx
- work-1:/work
- backup-1:/backup
- samples-1:/storage
- uploads-1:/uploads
networks:
- wmsa
depends_on:
- mariadb
environment:
- "WMSA_SERVICE_NODE=1"
x-p2: &partition-2
env_file:
- "run/env/service.env"
volumes:
- conf:/wmsa/conf:ro
- model:/wmsa/model
- data:/wmsa/data
- logs:/var/log/wmsa
- index-2:/idx
- work-2:/work
- backup-2:/backup
- samples-2:/storage
- uploads-2:/uploads
networks:
- wmsa
depends_on:
mariadb:
condition: service_healthy
environment:
- "WMSA_SERVICE_NODE=2"
services:
index-service-1:
<<: *partition-1
image: "marginalia/index-service"
container_name: "index-service-1"
executor-service-1:
<<: *partition-1
image: "marginalia/executor-service"
container_name: "executor-service-1"
index-service-2:
<<: *partition-2
image: "marginalia/index-service"
container_name: "index-service-2"
executor-service-2:
<<: *partition-2
image: "marginalia/executor-service"
container_name: "executor-service-2"
query-service:
<<: *service
image: "marginalia/query-service"
container_name: "query-service"
search-service:
<<: *service
image: "marginalia/search-service"
container_name: "search-service"
expose:
- 80
labels:
- "traefik.enable=true"
- "traefik.http.routers.search-service.rule=PathPrefix(`/`)"
- "traefik.http.routers.search-service.entrypoints=search"
- "traefik.http.routers.search-service.middlewares=add-xpublic"
- "traefik.http.routers.search-service.middlewares=add-public"
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
assistant-service:
<<: *service
image: "marginalia/assistant-service"
container_name: "assistant-service"
expose:
- 80
labels:
- "traefik.enable=true"
- "traefik.http.routers.assistant-service-screenshot.rule=PathPrefix(`/screenshot`)"
- "traefik.http.routers.assistant-service-screenshot.entrypoints=search,dating"
- "traefik.http.routers.assistant-service-screenshot.middlewares=add-xpublic"
- "traefik.http.routers.assistant-service-screenshot.middlewares=add-public"
- "traefik.http.routers.assistant-service-suggest.rule=PathPrefix(`/suggest`)"
- "traefik.http.routers.assistant-service-suggest.entrypoints=search"
- "traefik.http.routers.assistant-service-suggest.middlewares=add-xpublic"
- "traefik.http.routers.assistant-service-suggest.middlewares=add-public"
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
api-service:
<<: *service
image: "marginalia/api-service"
container_name: "api-service"
expose:
- "80"
labels:
- "traefik.enable=true"
- "traefik.http.routers.api-service.rule=PathPrefix(`/`)"
- "traefik.http.routers.api-service.entrypoints=api"
- "traefik.http.routers.api-service.middlewares=add-xpublic"
- "traefik.http.routers.api-service.middlewares=add-public"
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
dating-service:
<<: *service
image: "marginalia/dating-service"
container_name: "dating-service"
expose:
- 80
labels:
- "traefik.enable=true"
- "traefik.http.routers.dating-service.rule=PathPrefix(`/`)"
- "traefik.http.routers.dating-service.entrypoints=dating"
- "traefik.http.routers.dating-service.middlewares=add-xpublic"
- "traefik.http.routers.dating-service.middlewares=add-public"
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
explorer-service:
<<: *service
image: "marginalia/explorer-service"
container_name: "explorer-service"
expose:
- 80
labels:
- "traefik.enable=true"
- "traefik.http.routers.explorer-service.rule=PathPrefix(`/`)"
- "traefik.http.routers.explorer-service.entrypoints=explore"
- "traefik.http.routers.explorer-service.middlewares=add-xpublic"
- "traefik.http.routers.explorer-service.middlewares=add-public"
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
control-service:
<<: *service
image: "marginalia/control-service"
container_name: "control-service"
expose:
- 80
labels:
- "traefik.enable=true"
- "traefik.http.routers.control-service.rule=PathPrefix(`/`)"
- "traefik.http.routers.control-service.entrypoints=control"
- "traefik.http.routers.control-service.middlewares=add-xpublic"
- "traefik.http.routers.control-service.middlewares=add-public"
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
mariadb:
image: "mariadb:lts"
container_name: "mariadb"
env_file: "run/env/mariadb.env"
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
ports:
- "127.0.0.1:3306:3306/tcp"
healthcheck:
test: mysqladmin ping -h 127.0.0.1 -u $$MARIADB_USER --password=$$MARIADB_PASSWORD
start_period: 5s
interval: 5s
timeout: 5s
retries: 60
volumes:
- db:/var/lib/mysql
- "./code/common/db/src/main/resources/sql/current/:/docker-entrypoint-initdb.d/"
networks:
- wmsa
traefik:
image: "traefik:v2.10"
container_name: "traefik"
command:
#- "--log.level=DEBUG"
- "--api.insecure=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--entrypoints.search.address=:80"
- "--entrypoints.control.address=:81"
- "--entrypoints.api.address=:82"
- "--entrypoints.dating.address=:83"
- "--entrypoints.explore.address=:84"
ports:
- "127.0.0.1:8080:80"
- "127.0.0.1:8081:81"
- "127.0.0.1:8082:82"
- "127.0.0.1:8083:83"
- "127.0.0.1:8084:84"
- "127.0.0.1:8090:8080"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:ro"
networks:
- wmsa
prometheus:
image: "prom/prometheus"
container_name: "prometheus"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
ports:
- "127.0.0.1:8091:9090"
volumes:
- "./run/prometheus.yml:/etc/prometheus/prometheus.yml"
- "/var/run/docker.sock:/var/run/docker.sock:ro"
networks:
- wmsa
networks:
wmsa:
volumes:
db:
driver: local
driver_opts:
type: none
o: bind
device: run/db
logs:
driver: local
driver_opts:
type: none
o: bind
device: run/logs
model:
driver: local
driver_opts:
type: none
o: bind
device: run/model
conf:
driver: local
driver_opts:
type: none
o: bind
device: run/conf
data:
driver: local
driver_opts:
type: none
o: bind
device: run/data
samples-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/samples
index-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/index
work-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/work
backup-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/backup
uploads-1:
driver: local
driver_opts:
type: none
o: bind
device: run/node-1/uploads
samples-2:
driver: local
driver_opts:
type: none
o: bind
device: run/node-2/samples
index-2:
driver: local
driver_opts:
type: none
o: bind
device: run/node-2/index
work-2:
driver: local
driver_opts:
type: none
o: bind
device: run/node-2/work
backup-2:
driver: local
driver_opts:
type: none
o: bind
device: run/node-2/backup
uploads-2:
driver: local
driver_opts:
type: none
o: bind
device: run/node-2/uploads

View File

@ -1,59 +0,0 @@
#!/bin/bash
set -e
# Check if wget exists
if command -v wget &> /dev/null; then
dl_prg="wget -O"
elif command -v curl &> /dev/null; then
dl_prg="curl -o"
else
echo "Neither wget nor curl found, exiting .."
exit 1
fi
case "$1" in
"s"|"m"|"l"|"xl")
;;
*)
echo "Invalid argument. Must be one of 's', 'm', 'l' or 'xl'."
exit 1
;;
esac
SAMPLE_NAME=crawl-${1:-m}
SAMPLE_DIR="node-1/samples/${SAMPLE_NAME}/"
function download_model {
model=$1
url=$2
if [ ! -f $model ]; then
echo "** Downloading $url"
$dl_prg $model $url
fi
}
pushd $(dirname $0)
if [ -d ${SAMPLE_DIR} ]; then
echo "${SAMPLE_DIR} already exists; remove it if you want to re-download the sample"
fi
mkdir -p node-1/samples/
SAMPLE_TARBALL=samples/${SAMPLE_NAME}.tar.gz
download_model ${SAMPLE_TARBALL}.tmp https://downloads.marginalia.nu/${SAMPLE_TARBALL} && mv ${SAMPLE_TARBALL}.tmp ${SAMPLE_TARBALL}
if [ ! -f ${SAMPLE_TARBALL} ]; then
echo "!! Failed"
exit 255
fi
mkdir -p ${SAMPLE_DIR}
tar zxf ${SAMPLE_TARBALL} --strip-components=1 -C ${SAMPLE_DIR}
cat > "${SAMPLE_DIR}/marginalia-manifest.json" <<EOF
{ "description": "Sample data set ${SAMPLE_NAME}", "type": "CRAWL_DATA" }
EOF
popd

View File

@ -1,8 +1,10 @@
# Run
When developing locally, this directory will contain run-time data required for
the search engine. In a clean check-out, it only contains the tools required to
bootstrap this directory structure.
This directory is a staging area for running the system. It contains scripts
and templates for installing the system on a server, and for running it locally.
See [https://docs.marginalia.nu/](https://docs.marginalia.nu/) for additional
documentation.
## Requirements
@ -16,8 +18,7 @@ graalce is a good distribution choice but it doesn't matter too much.
## Set up
To go from a clean check out of the git repo to a running search engine,
follow these steps. This assumes a test deployment. For a production like
setup... (TODO: write a guide for this).
follow these steps.
You're assumed to sit in the project root the whole time.
@ -35,106 +36,40 @@ $ run/setup.sh
```shell
$ ./gradlew docker
```
### 3. Initialize the database
Before the system can be brought online, the database needs to be initialized. To do this,
bring up the database in the background, and run the flyway migration tool.
### 3. Install the system
```shell
$ docker-compose up -d mariadb
$ ./gradlew flywayMigrate
$ run/install.sh <install-directory>
```
### 4. Bring the system online.
To install the system, you need to run the install script. It will prompt
you for which installation mode you want to use. The options are:
We'll run it in the foreground in the terminal this time because it's educational to see the logs.
Add `-d` to run in the background.
1. Barebones - This will install a white-label search engine with no data. You can
use this to index your own data. It disables and hides functionality that is strongly
related to the Marginalia project, such as the Marginalia GUI.
2. Full Marginalia Search instance - This will install an instance of the search engine
configured like [search.marginalia.nu](https://search.marginalia.nu). This is useful
for local development and testing.
It will also prompt you for account details for a new mariadb instance, which will be
created for you. The database will be initialized with the schema and data required
for the search engine to run.
After filling out all the details, the script will copy the installation files to the
specified directory.
### 4. Run the system
```shell
$ docker-compose up
$ cd install_directory
$ docker-compose up -d
# To see the logs:
$ docker-compose logs -f
```
There are two docker-compose files available, `docker-compose.yml` and `docker-compose-barebones.yml`;
the latter is a stripped down version that only runs the bare minimum required to run the system, for e.g.
running a whitelabel version of the system. The former is the full system with all the frills of
Marginalia Search, and is the one used by default.
You can now access a search interface at `http://localhost:8080`, and the admin interface
at `http://localhost:8081/`.
To start the barebones version, run:
```shell
$ docker-compose -f docker-compose-barebones.yml up
```
### 5. You should now be able to access the system.
By default, the docker-compose file publishes the following ports:
| Address | Description |
|-------------------------|------------------|
| http://localhost:8080/ | User-facing GUI |
| http://localhost:8081/ | Operator's GUI |
Note that the operator's GUI does not perform any sort of authentication.
Preferably don't expose it publicly, but if you absolutely must, use a proxy or
Basic Auth to add security.
### 6. Download Sample Data
A script is available for downloading sample data. The script will download the
data from https://downloads.marginalia.nu/ and extract it to the correct location.
The system will pick the data up automatically.
```shell
$ run/download-samples.sh l
```
Four sets are available:
| Name | Description |
|------|---------------------------------|
| s | Small set, 1000 domains |
| m | Medium set, 2000 domains |
| l | Large set, 5000 domains |
| xl | Extra large set, 50,000 domains |
Warning: The XL set is intended to provide a large amount of data for
setting up a pre-production environment. It may be hard to run on a smaller
machine and will on most machines take several hours to process.
The 'm' or 'l' sets are a good compromise between size and processing time
and should work on most machines.
### 7. Process the data
Bring the system online if it isn't (see step 4), then go to the operator's
GUI (see step 5).
* Go to `Node 1 -> Storage -> Crawl Data`
* Hit the toggle to set your crawl data to be active
* Go to `Actions -> Process Crawl Data -> [Trigger Reprocessing]`
This will take anywhere between a few minutes to a few hours depending on which
data set you downloaded. You can monitor the progress from the `Overview` tab.
First the CONVERTER is expected to run; this will process the data into a format
that can easily be inserted into the database and index.
Next the LOADER will run; this will insert the data into the database and index.
Next the link database will repartition itself, and finally the index will be
reconstructed. You can view the process of these steps in the `Jobs` listing.
### 8. Run the system
Once all this is done, you can go to the user-facing GUI (see step 5) and try
a search.
Important! Use the 'No Ranking' option when running locally, since you'll very
likely not have enough links for the ranking algorithm to perform well.
## Experiment Runner
The script `experiment.sh` is a launcher for the experiment runner, which is useful when
evaluating new algorithms in processing crawl data.
There is no data in the system yet. To load data into the system,
see the guide at [https://docs.marginalia.nu/](https://docs.marginalia.nu/).