(doc) Migrate documentation https://docs.marginalia.nu/
13
README.md
@ -13,12 +13,21 @@ The long term plan is to refine the search engine so that it provide enough publ
|
||||
that the project can be funded through grants, donations and commercial API licenses
|
||||
(non-commercial share-alike is always free).
|
||||
|
||||
The system can both be run as a copy of Marginalia Search, or as a white-label search engine
|
||||
for your own data (either crawled or side-loaded). At present the logic isn't very configurable, and a lot of the judgements
|
||||
made are based on the Marginalia project's goals, but additional configurability is being
|
||||
worked on!
|
||||
|
||||
## Set up
|
||||
|
||||
Start by running [⚙️ run/setup.sh](run/setup.sh). This will download supplementary model data that is necessary to run the code.
|
||||
To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)!
|
||||
|
||||
Further documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/).
|
||||
|
||||
Before compiling, it's necessary to run [⚙️ run/setup.sh](run/setup.sh).
|
||||
This will download supplementary model data that is necessary to run the code.
|
||||
These are also necessary to run the tests.
|
||||
|
||||
To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)!
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
|
115
doc/crawling.md
@ -1,115 +0,0 @@
|
||||
# Crawling
|
||||
|
||||
## WARNING
|
||||
|
||||
Please don't run the crawler unless you intend to actually operate a public
|
||||
facing search engine! For testing, use crawl sets from [downloads.marginalia.nu](https://downloads.marginalia.nu/) instead;
|
||||
or if you wish to play with the crawler, crawl a small set of domains from people who are
|
||||
ok with it, use your own, your friends, or any subdomain from marginalia.nu.
|
||||
|
||||
See the documentation in run/ for more information on how to load sample data!
|
||||
|
||||
Reckless crawling annoys webmasters and makes it harder to run an independent search engine.
|
||||
Crawling from a domestic IP address is also likely to put you on a greylist
|
||||
of probable bots. You will solve CAPTCHAs for almost every website you visit
|
||||
for weeks, and may be permanently blocked from a few IPs.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of
|
||||
DNS traffic.
|
||||
|
||||
These processes require a lot of disk space. It's strongly recommended to use a dedicated disk for
|
||||
the index storage subdirectory, it doesn't need to be extremely fast, but it should be a few terabytes in size.
|
||||
|
||||
It should be mounted with `noatime`. It may be a good idea to format the disk with a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler.
|
||||
|
||||
Make sure you configure the user-agent properly. This will be used to identify the crawler,
|
||||
and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it.
|
||||
See [wiki://Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard) for more information
|
||||
about robots.txt; the user agent can be configured in conf/properties/system.properties; see the
|
||||
[system-properties](system-properties.md) documentation for more information.
|
||||
|
||||
## Setup
|
||||
|
||||
Ensure that the system is running and go to https://localhost:8081.
|
||||
|
||||
With the default test configuration, the system is configured to
|
||||
store data in `node-1/storage`.
|
||||
|
||||
## Fresh Crawl
|
||||
|
||||
While a running search engine can use the link database to figure out which websites to visit, a clean
|
||||
system does not know of any links. To bootstrap a crawl, a crawl specification needs to be created to
|
||||
seed the domain database.
|
||||
|
||||
Go to `Nodes->Node 1->Actions->New Crawl`
|
||||
|
||||
![img](images/new_crawl.png)
|
||||
|
||||
Click the link that says 'New Spec' to arrive at a form for creating a new specification:
|
||||
|
||||
![img](images/new_spec.png)
|
||||
|
||||
Fill out the form with a description and a link to a domain list. The domain list is a text file
|
||||
with one domain per line, with blank lines and comments starting with `#` ignored. You can use
|
||||
github raw links for this purpose. For test purposes, you can use this link:
|
||||
`https://downloads.marginalia.nu/domain-list-test.txt`, which will create a crawl for a few
|
||||
of marignalia.nu's subdomains.
|
||||
|
||||
If you aren't redirected there automatically, go back to the `New Crawl` page under Node 1 -> Actions.
|
||||
Your new specification should now be listed.
|
||||
|
||||
Check the box next to it, and click `[Trigger New Crawl]`.
|
||||
|
||||
![img](images/new_crawl2.png)
|
||||
|
||||
This will start the crawling process. Crawling may take a while, depending on the size
|
||||
of the domain list and the size of the websites.
|
||||
|
||||
![img](images/crawl_in_progress.png)
|
||||
|
||||
Eventually a process bar will show up, and the crawl will start. When it reaches 100%, the crawl is done.
|
||||
You can also monitor the `Events Summary` table on the same page to see what happened after the fact.
|
||||
|
||||
It is expected that the crawl will stall out toward the end of the process, this is a statistical effect since
|
||||
the largest websites take the longest to finish, and tend to be the ones lingering at 99% or so completion. The
|
||||
crawler has a timeout of 5 hours, where if no new domains are finished crawling, it will stop, to prevent crawler traps
|
||||
from stalling the crawl indefinitely.
|
||||
|
||||
**Be sure to read the section on re-crawling!**
|
||||
|
||||
## Converting
|
||||
|
||||
Once the crawl is done, the data needs to be processed before its searchable. This is done by going to
|
||||
`Nodes->Node 1->Actions->Process Crawl Data`.
|
||||
|
||||
![Conversion screenshot](images/convert.png)
|
||||
|
||||
This will start the conversion process. This will again take a while, depending on the size of the crawl.
|
||||
The process bar will show the progress. When it reaches 100%, the conversion is done, and the data will begin
|
||||
loading automatically. A cascade of actions is performed in sequence, leading to the data being loaded into the
|
||||
search engine and an index being constructed. This is all automatic, but depending on the size of the crawl data,
|
||||
may take a while.
|
||||
|
||||
When an event `INDEX-SWITCH-OK` is logged in the `Event Summary` table, the data is ready to be searched.
|
||||
|
||||
## Re-crawling
|
||||
|
||||
The work flow with a crawl spec was a one-off process to bootstrap the search engine. To keep the search engine up to date,
|
||||
it is preferable to do a recrawl. This will try to reduce the amount of data that needs to be fetched.
|
||||
|
||||
To trigger a Recrawl, go to `Nodes->Node 1->Actions->Re-crawl`. This will bring you to a page that looks similar to the
|
||||
first crawl page, where you can select a set of crawl data to use as a source. Select the crawl data you want, and
|
||||
press `[Trigger Recrawl]`.
|
||||
|
||||
Crawling will proceed as before, but this time, the crawler will try to fetch only the data that has changed since the
|
||||
last crawl, increasing the number of documents by a percentage. This will typically be much faster than the initial crawl.
|
||||
|
||||
### Growing the crawl set
|
||||
|
||||
The re-crawl will also pull new domains from the `New Domains` dataset, which is an URL configurable in
|
||||
`[Top Menu] -> System -> Data Sets`. If a new domain is found, it will be assigned to the present node, and crawled in
|
||||
the re-crawl.
|
||||
|
||||
![Datasets screenshot](images/datasets.png)
|
Before Width: | Height: | Size: 42 KiB |
Before Width: | Height: | Size: 62 KiB |
Before Width: | Height: | Size: 42 KiB |
Before Width: | Height: | Size: 44 KiB |
Before Width: | Height: | Size: 35 KiB |
Before Width: | Height: | Size: 18 KiB |
Before Width: | Height: | Size: 13 KiB |
Before Width: | Height: | Size: 33 KiB |
Before Width: | Height: | Size: 48 KiB |
Before Width: | Height: | Size: 44 KiB |
@ -3,12 +3,11 @@
|
||||
A lot of the architectural description is sprinkled into the code repository closer to the code.
|
||||
Start in [📁 ../code/](../code/) and poke around.
|
||||
|
||||
Operational documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/).
|
||||
|
||||
## Operations
|
||||
|
||||
* [System Properties](system-properties.md) - JVM property flags
|
||||
|
||||
## How-To
|
||||
* [Sideloading How-To](sideloading-howto.md) - How to sideload various data sets
|
||||
* [Parquet How-To](parquet-howto.md) - Useful tips in working with Parquet files
|
||||
|
||||
## Set-up
|
||||
|
@ -1,211 +0,0 @@
|
||||
# Sideloading How-To
|
||||
|
||||
Some websites are much larger than others, this includes
|
||||
Wikipedia, Stack Overflow, and a few others. They are so
|
||||
large they are impractical to crawl in the traditional fashion,
|
||||
but luckily they make available data dumps that can be processed
|
||||
and loaded into the search engine through other means.
|
||||
|
||||
To this end, it's possible to sideload data into the search engine
|
||||
from other sources than the web crawler.
|
||||
|
||||
## Index Nodes
|
||||
|
||||
In practice, if you want to sideload data, you need to do it on
|
||||
a separate index node. Index nodes are separate instances of the
|
||||
index software. The default configuration is to have two index nodes,
|
||||
one for the web crawler, and one for sideloaded data.
|
||||
|
||||
The need for a separate node is due to incompatibilities in the work flows.
|
||||
|
||||
It is also a good idea in general, as very large domains can easily be so large that the entire time budget
|
||||
for the query is spent sifting through documents from that one domain, this is
|
||||
especially true with something like Wikipedia, which has a lot of documents at
|
||||
least tangentially related to any given topic.
|
||||
|
||||
This how-to assumes that you are operating on index-node 2.
|
||||
|
||||
## Notes on the upload directory
|
||||
|
||||
This is written assuming that the system is installed with the `install.sh`
|
||||
script, which deploys the system with docker-compose, and has a directory
|
||||
structure like
|
||||
|
||||
```
|
||||
...
|
||||
index-1/backup/
|
||||
index-1/index/
|
||||
index-1/storage/
|
||||
index-1/uploads/
|
||||
index-1/work/
|
||||
index-2/backup/
|
||||
index-2/index/
|
||||
index-2/storage/
|
||||
index-2/uploads/
|
||||
index-2/work/
|
||||
...
|
||||
```
|
||||
|
||||
We're going to be putting files in the **uploads** directories. If you have installed
|
||||
the system in some other way, or changed the configuration significantly, you need
|
||||
to adjust the paths accordingly.
|
||||
|
||||
## Sideloading
|
||||
|
||||
The sideloading actions are available through Actions menu in each node.
|
||||
|
||||
![Sideload menu](images/sideload_menu.png)
|
||||
|
||||
## Sideloading WARCs
|
||||
|
||||
WARC files are the standard format for web archives. They can be created e.g. with wget.
|
||||
The Marginalia software can read WARC files directly, and sideload them into the index,
|
||||
as long as each warc file contains only one domain.
|
||||
|
||||
Let's for example archive www.marginalia.nu (I own this domain, so feel free to try this at home)
|
||||
|
||||
```bash
|
||||
$ wget -r --warc-file=marginalia www.marginalia.nu
|
||||
```
|
||||
|
||||
**Note** If you intend to do this on other websites, you should probably add a `--wait` parameter to wget,
|
||||
e.g. `wget --wait=1 -r --warc-file=...` to avoid hammering the website with requests and getting blocked.
|
||||
|
||||
This will take a moment, and create a file called `marginalia.warc.gz`. We move it to the
|
||||
upload directory of the index node, and sideload it through the Actions menu.
|
||||
|
||||
```bash
|
||||
$ mkdir -p index-2/uploads/marginalia-warc
|
||||
$ mv marginalia.warc.gz index-2/uploads/marginalia-warc
|
||||
```
|
||||
|
||||
Go to the Actions menu, and select the "Sideload WARC" action. This will show a list of
|
||||
subdirectories in the Uploads directory. Select the directory containing the WARC file, and
|
||||
click "Sideload".
|
||||
|
||||
![Sideload WARC screenshot](images/sideload_warc.png)
|
||||
|
||||
This should take you to the node overview, where you can see the progress of the sideloading.
|
||||
It will take a moment, as the WARC file is being processed.
|
||||
|
||||
![Processing in progress](images/convert_2.png)
|
||||
|
||||
It will not be loaded automatically. This is to permit you to sideload multiple sources.
|
||||
|
||||
When you are ready to load it, go to the Actions menu, and select "Load Crawl Data".
|
||||
|
||||
![Load Crawl Data](images/load_warc.png)
|
||||
|
||||
Select all the sources you want to load, and click "Load". This will load the data into the
|
||||
index, and make it available for searching.
|
||||
|
||||
## Sideloading Wikipedia
|
||||
|
||||
Due to licensing incompatibilities with OpenZim's GPL-2 and AGPL, the workflow
|
||||
depends on using the conversion process from [https://encyclopedia.marginalia.nu/](https://encyclopedia.marginalia.nu/)
|
||||
to pre-digest the data.
|
||||
|
||||
Build the [encyclopedia.marginalia.nu Code](https://github.com/MarginaliaSearch/encyclopedia.marginalia.nu)
|
||||
and follow the instructions for downloading a ZIM file, and then run something like
|
||||
|
||||
```$./encyclopedia convert file.zim articles.db```
|
||||
|
||||
This db-file can be processed and loaded into the search engine through the
|
||||
Actions view.
|
||||
|
||||
FIXME: It will currently only point to en.wikipedia.org, this should be
|
||||
made configurable.
|
||||
|
||||
|
||||
## Sideloading a directory tree
|
||||
|
||||
For relatively small websites, ad-hoc side-loading is available directly from a
|
||||
folder structure on the hard drive. This is intended for loading manuals,
|
||||
documentation and similar data sets that are large and slowly changing.
|
||||
|
||||
A website can be archived with wget, like this
|
||||
|
||||
```bash
|
||||
UA="search.marginalia.nu" \
|
||||
DOMAIN="www.example.com" \
|
||||
wget -nc -x --continue -w 1 -r -U ${UA} -A "html" ${DOMAIN}
|
||||
```
|
||||
|
||||
After doing this to a bunch of websites, create a YAML file something like this:
|
||||
|
||||
```yaml
|
||||
sources:
|
||||
- name: jdk-20
|
||||
dir: "jdk-20/"
|
||||
domainName: "docs.oracle.com"
|
||||
baseUrl: "https://docs.oracle.com/en/java/javase/20/docs"
|
||||
keywords:
|
||||
- "java"
|
||||
- "docs"
|
||||
- "documentation"
|
||||
- "javadoc"
|
||||
- name: python3
|
||||
dir: "python-3.11.5/"
|
||||
domainName: "docs.python.org"
|
||||
baseUrl: "https://docs.python.org/3/"
|
||||
keywords:
|
||||
- "python"
|
||||
- "docs"
|
||||
- "documentation"
|
||||
- name: mariadb.com
|
||||
dir: "mariadb.com/"
|
||||
domainName: "mariadb.com"
|
||||
baseUrl: "https://mariadb.com/"
|
||||
keywords:
|
||||
- "sql"
|
||||
- "docs"
|
||||
- "mariadb"
|
||||
- "mysql"
|
||||
```
|
||||
|
||||
|parameter|description|
|
||||
|----|----|
|
||||
|name|Purely informative|
|
||||
|dir|Path of website contents relative to the location of the yaml file|
|
||||
|domainName|The domain name of the website|
|
||||
|baseUrl|This URL will be prefixed to the contents of `dir`|
|
||||
|keywords|These supplemental keywords will be injected in each document|
|
||||
|
||||
The directory structure corresponding to the above might look like
|
||||
|
||||
```
|
||||
docs-index.yaml
|
||||
jdk-20/
|
||||
jdk-20/resources/
|
||||
jdk-20/api/
|
||||
jdk-20/api/[...]
|
||||
jdk-20/specs/
|
||||
jdk-20/specs/[...]
|
||||
jdk-20/index.html
|
||||
mariadb.com
|
||||
mariadb.com/kb/
|
||||
mariadb.com/kb/[...]
|
||||
python-3.11.5
|
||||
python-3.11.5/genindex-B.html
|
||||
python-3.11.5/library/
|
||||
python-3.11.5/distutils/
|
||||
python-3.11.5/[...]
|
||||
[...]
|
||||
```
|
||||
|
||||
This yaml-file can be processed and loaded into the search engine through the
|
||||
Actions view.
|
||||
|
||||
|
||||
## Sideloading Stack Overflow/Stackexchange
|
||||
|
||||
Stackexchange makes dumps available on Archive.org. These are unfortunately on a format that
|
||||
needs some heavy-handed pre-processing before they can be loaded. A tool is available for
|
||||
this in [tools/stackexchange-converter](../code/tools/stackexchange-converter).
|
||||
|
||||
After running `gradlew dist`, this tool is found in `build/dist/stackexchange-converter`,
|
||||
follow the instructions in the stackexchange-converter readme, and
|
||||
convert the stackexchange xml.7z-files to sqlite db-files.
|
||||
|
||||
A directory with such db-files can be processed and loaded into the
|
||||
search engine through the Actions view.
|
@ -1,42 +0,0 @@
|
||||
# System Properties
|
||||
|
||||
These are JVM system properties used by each service. These properties can either
|
||||
be loaded from a file or passed in as command line arguments, using `$JAVA_OPTS`.
|
||||
|
||||
The system will look for a properties file in `conf/properties/system.properties`,
|
||||
within the install dir, as specified by `$WMSA_HOME`.
|
||||
|
||||
A template is available in [../run/template/conf/properties/system.properties](../run/template/conf/properties/system.properties).
|
||||
|
||||
## Global
|
||||
|
||||
| flag | values | description |
|
||||
|-------------|------------|--------------------------------------|
|
||||
| blacklist.disable | boolean | Disables the IP blacklist |
|
||||
| flyway.disable | boolean | Disables automatic Flyway migrations |
|
||||
|
||||
## Crawler Properties
|
||||
|
||||
| flag | values | description |
|
||||
|------------------------------|------------|---------------------------------------------------------------------------------------------|
|
||||
| crawler.userAgentString | string | Sets the user agent string used by the crawler |
|
||||
| crawler.userAgentIdentifier | string | Sets the user agent identifier used by the crawler, e.g. what it looks for in robots.txt |
|
||||
| crawler.poolSize | integer | Sets the number of threads used by the crawler, more is faster, but uses more RAM |
|
||||
| crawler.initialUrlsPerDomain | integer | Sets the initial number of URLs to crawl per domain (when crawling from spec) |
|
||||
| crawler.maxUrlsPerDomain | integer | Sets the maximum number of URLs to crawl per domain (when recrawling) |
|
||||
| crawler.minUrlsPerDomain | integer | Sets the minimum number of URLs to crawl per domain (when recrawling) |
|
||||
| crawler.crawlSetGrowthFactor | double | If 100 documents were fetched last crawl, increase the goal to 100 x (this value) this time |
|
||||
| ip-blocklist.disabled | boolean | Disables the IP blocklist |
|
||||
|
||||
## Converter Properties
|
||||
|
||||
| flag | values | description |
|
||||
|-----------------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| converter.sideloadThreshold | integer | Threshold value, in number of documents per domain, where a simpler processing method is used which uses less RAM. 10,000 is a good value for ~32GB RAM |
|
||||
|
||||
# Marginalia Application Specific
|
||||
|
||||
| flag | values | description |
|
||||
|---------------------------|------------|---------------------------------------------------------------|
|
||||
| search.websiteUrl | string | Overrides the website URL used in rendering |
|
||||
| control.hideMarginaliaApp | boolean | Hides the Marginalia application from the control GUI results |
|
@ -1,181 +0,0 @@
|
||||
# This is the barebones docker-compose file for the Marginalia Search Engine.
|
||||
#
|
||||
# It starts a stripped-down version of the search engine, with only the essential
|
||||
# services running, including the database, the query service, the control service,
|
||||
# and a single index and executor node.
|
||||
#
|
||||
# It is a good starting point for setting up a white-label search engine that does not
|
||||
# have Marginalia's GUI. The Query Service presents a simple search box, that also talks
|
||||
# JSON, so you can use it as a backend for your own search interface.
|
||||
|
||||
|
||||
x-svc: &service
|
||||
env_file:
|
||||
- "run/env/service.env"
|
||||
volumes:
|
||||
- conf:/wmsa/conf:ro
|
||||
- model:/wmsa/model
|
||||
- data:/wmsa/data
|
||||
- logs:/var/log/wmsa
|
||||
networks:
|
||||
- wmsa
|
||||
depends_on:
|
||||
- mariadb
|
||||
labels:
|
||||
- "__meta_docker_port_private=7000"
|
||||
x-p1: &partition-1
|
||||
env_file:
|
||||
- "run/env/service.env"
|
||||
volumes:
|
||||
- conf:/wmsa/conf:ro
|
||||
- model:/wmsa/model
|
||||
- data:/wmsa/data
|
||||
- logs:/var/log/wmsa
|
||||
- index-1:/idx
|
||||
- work-1:/work
|
||||
- backup-1:/backup
|
||||
- samples-1:/storage
|
||||
- uploads-1:/uploads
|
||||
networks:
|
||||
- wmsa
|
||||
depends_on:
|
||||
- mariadb
|
||||
environment:
|
||||
- "WMSA_SERVICE_NODE=1"
|
||||
|
||||
services:
|
||||
index-service-1:
|
||||
<<: *partition-1
|
||||
image: "marginalia/index-service"
|
||||
container_name: "index-service-1"
|
||||
executor-service-1:
|
||||
<<: *partition-1
|
||||
image: "marginalia/executor-service"
|
||||
container_name: "executor-service-1"
|
||||
query-service:
|
||||
<<: *service
|
||||
image: "marginalia/query-service"
|
||||
container_name: "query-service"
|
||||
expose:
|
||||
- 80
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.search-service.rule=PathPrefix(`/`)"
|
||||
- "traefik.http.routers.search-service.entrypoints=search"
|
||||
- "traefik.http.routers.search-service.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.search-service.middlewares=add-public"
|
||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
||||
control-service:
|
||||
<<: *service
|
||||
image: "marginalia/control-service"
|
||||
container_name: "control-service"
|
||||
expose:
|
||||
- 80
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.control-service.rule=PathPrefix(`/`)"
|
||||
- "traefik.http.routers.control-service.entrypoints=control"
|
||||
- "traefik.http.routers.control-service.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.control-service.middlewares=add-public"
|
||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
||||
mariadb:
|
||||
image: "mariadb:lts"
|
||||
container_name: "mariadb"
|
||||
env_file: "run/env/mariadb.env"
|
||||
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
|
||||
ports:
|
||||
- "127.0.0.1:3306:3306/tcp"
|
||||
healthcheck:
|
||||
test: mysqladmin ping -h 127.0.0.1 -u $$MARIADB_USER --password=$$MARIADB_PASSWORD
|
||||
start_period: 5s
|
||||
interval: 5s
|
||||
timeout: 5s
|
||||
retries: 60
|
||||
volumes:
|
||||
- db:/var/lib/mysql
|
||||
- "./code/common/db/src/main/resources/sql/current/:/docker-entrypoint-initdb.d/"
|
||||
networks:
|
||||
- wmsa
|
||||
traefik:
|
||||
image: "traefik:v2.10"
|
||||
container_name: "traefik"
|
||||
command:
|
||||
#- "--log.level=DEBUG"
|
||||
- "--api.insecure=true"
|
||||
- "--providers.docker=true"
|
||||
- "--providers.docker.exposedbydefault=false"
|
||||
- "--entrypoints.search.address=:80"
|
||||
- "--entrypoints.control.address=:81"
|
||||
ports:
|
||||
- "127.0.0.1:8080:80"
|
||||
- "127.0.0.1:8081:81"
|
||||
- "127.0.0.1:8090:8080"
|
||||
volumes:
|
||||
- "/var/run/docker.sock:/var/run/docker.sock:ro"
|
||||
networks:
|
||||
- wmsa
|
||||
networks:
|
||||
wmsa:
|
||||
volumes:
|
||||
db:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/db
|
||||
logs:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/logs
|
||||
model:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/model
|
||||
conf:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/conf
|
||||
data:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/data
|
||||
samples-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/samples
|
||||
index-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/index
|
||||
work-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/work
|
||||
backup-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/backup
|
||||
uploads-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/uploads
|
@ -1,315 +0,0 @@
|
||||
# This is the full docker-compose.yml file for the Marginalia Search Engine.
|
||||
#
|
||||
# It starts all the services, including the GUI, the database, the query service,
|
||||
# two nodes for demo purposes, as well as a bunch of peripheral services that are
|
||||
# application specific.
|
||||
#
|
||||
|
||||
x-svc: &service
|
||||
env_file:
|
||||
- "run/env/service.env"
|
||||
volumes:
|
||||
- conf:/wmsa/conf:ro
|
||||
- model:/wmsa/model
|
||||
- data:/wmsa/data
|
||||
- logs:/var/log/wmsa
|
||||
networks:
|
||||
- wmsa
|
||||
labels:
|
||||
- "__meta_docker_port_private=7000"
|
||||
x-p1: &partition-1
|
||||
env_file:
|
||||
- "run/env/service.env"
|
||||
volumes:
|
||||
- conf:/wmsa/conf:ro
|
||||
- model:/wmsa/model
|
||||
- data:/wmsa/data
|
||||
- logs:/var/log/wmsa
|
||||
- index-1:/idx
|
||||
- work-1:/work
|
||||
- backup-1:/backup
|
||||
- samples-1:/storage
|
||||
- uploads-1:/uploads
|
||||
networks:
|
||||
- wmsa
|
||||
depends_on:
|
||||
- mariadb
|
||||
environment:
|
||||
- "WMSA_SERVICE_NODE=1"
|
||||
x-p2: &partition-2
|
||||
env_file:
|
||||
- "run/env/service.env"
|
||||
volumes:
|
||||
- conf:/wmsa/conf:ro
|
||||
- model:/wmsa/model
|
||||
- data:/wmsa/data
|
||||
- logs:/var/log/wmsa
|
||||
- index-2:/idx
|
||||
- work-2:/work
|
||||
- backup-2:/backup
|
||||
- samples-2:/storage
|
||||
- uploads-2:/uploads
|
||||
networks:
|
||||
- wmsa
|
||||
depends_on:
|
||||
mariadb:
|
||||
condition: service_healthy
|
||||
environment:
|
||||
- "WMSA_SERVICE_NODE=2"
|
||||
|
||||
services:
|
||||
index-service-1:
|
||||
<<: *partition-1
|
||||
image: "marginalia/index-service"
|
||||
container_name: "index-service-1"
|
||||
executor-service-1:
|
||||
<<: *partition-1
|
||||
image: "marginalia/executor-service"
|
||||
container_name: "executor-service-1"
|
||||
index-service-2:
|
||||
<<: *partition-2
|
||||
image: "marginalia/index-service"
|
||||
container_name: "index-service-2"
|
||||
executor-service-2:
|
||||
<<: *partition-2
|
||||
image: "marginalia/executor-service"
|
||||
container_name: "executor-service-2"
|
||||
query-service:
|
||||
<<: *service
|
||||
image: "marginalia/query-service"
|
||||
container_name: "query-service"
|
||||
search-service:
|
||||
<<: *service
|
||||
image: "marginalia/search-service"
|
||||
container_name: "search-service"
|
||||
expose:
|
||||
- 80
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.search-service.rule=PathPrefix(`/`)"
|
||||
- "traefik.http.routers.search-service.entrypoints=search"
|
||||
- "traefik.http.routers.search-service.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.search-service.middlewares=add-public"
|
||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
||||
assistant-service:
|
||||
<<: *service
|
||||
image: "marginalia/assistant-service"
|
||||
container_name: "assistant-service"
|
||||
expose:
|
||||
- 80
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.assistant-service-screenshot.rule=PathPrefix(`/screenshot`)"
|
||||
- "traefik.http.routers.assistant-service-screenshot.entrypoints=search,dating"
|
||||
- "traefik.http.routers.assistant-service-screenshot.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.assistant-service-screenshot.middlewares=add-public"
|
||||
- "traefik.http.routers.assistant-service-suggest.rule=PathPrefix(`/suggest`)"
|
||||
- "traefik.http.routers.assistant-service-suggest.entrypoints=search"
|
||||
- "traefik.http.routers.assistant-service-suggest.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.assistant-service-suggest.middlewares=add-public"
|
||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
||||
api-service:
|
||||
<<: *service
|
||||
image: "marginalia/api-service"
|
||||
container_name: "api-service"
|
||||
expose:
|
||||
- "80"
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.api-service.rule=PathPrefix(`/`)"
|
||||
- "traefik.http.routers.api-service.entrypoints=api"
|
||||
- "traefik.http.routers.api-service.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.api-service.middlewares=add-public"
|
||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
||||
dating-service:
|
||||
<<: *service
|
||||
image: "marginalia/dating-service"
|
||||
container_name: "dating-service"
|
||||
expose:
|
||||
- 80
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.dating-service.rule=PathPrefix(`/`)"
|
||||
- "traefik.http.routers.dating-service.entrypoints=dating"
|
||||
- "traefik.http.routers.dating-service.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.dating-service.middlewares=add-public"
|
||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
||||
explorer-service:
|
||||
<<: *service
|
||||
image: "marginalia/explorer-service"
|
||||
container_name: "explorer-service"
|
||||
expose:
|
||||
- 80
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.explorer-service.rule=PathPrefix(`/`)"
|
||||
- "traefik.http.routers.explorer-service.entrypoints=explore"
|
||||
- "traefik.http.routers.explorer-service.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.explorer-service.middlewares=add-public"
|
||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
||||
control-service:
|
||||
<<: *service
|
||||
image: "marginalia/control-service"
|
||||
container_name: "control-service"
|
||||
expose:
|
||||
- 80
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.control-service.rule=PathPrefix(`/`)"
|
||||
- "traefik.http.routers.control-service.entrypoints=control"
|
||||
- "traefik.http.routers.control-service.middlewares=add-xpublic"
|
||||
- "traefik.http.routers.control-service.middlewares=add-public"
|
||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
||||
mariadb:
|
||||
image: "mariadb:lts"
|
||||
container_name: "mariadb"
|
||||
env_file: "run/env/mariadb.env"
|
||||
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
|
||||
ports:
|
||||
- "127.0.0.1:3306:3306/tcp"
|
||||
healthcheck:
|
||||
test: mysqladmin ping -h 127.0.0.1 -u $$MARIADB_USER --password=$$MARIADB_PASSWORD
|
||||
start_period: 5s
|
||||
interval: 5s
|
||||
timeout: 5s
|
||||
retries: 60
|
||||
volumes:
|
||||
- db:/var/lib/mysql
|
||||
- "./code/common/db/src/main/resources/sql/current/:/docker-entrypoint-initdb.d/"
|
||||
networks:
|
||||
- wmsa
|
||||
traefik:
|
||||
image: "traefik:v2.10"
|
||||
container_name: "traefik"
|
||||
command:
|
||||
#- "--log.level=DEBUG"
|
||||
- "--api.insecure=true"
|
||||
- "--providers.docker=true"
|
||||
- "--providers.docker.exposedbydefault=false"
|
||||
- "--entrypoints.search.address=:80"
|
||||
- "--entrypoints.control.address=:81"
|
||||
- "--entrypoints.api.address=:82"
|
||||
- "--entrypoints.dating.address=:83"
|
||||
- "--entrypoints.explore.address=:84"
|
||||
ports:
|
||||
- "127.0.0.1:8080:80"
|
||||
- "127.0.0.1:8081:81"
|
||||
- "127.0.0.1:8082:82"
|
||||
- "127.0.0.1:8083:83"
|
||||
- "127.0.0.1:8084:84"
|
||||
- "127.0.0.1:8090:8080"
|
||||
volumes:
|
||||
- "/var/run/docker.sock:/var/run/docker.sock:ro"
|
||||
networks:
|
||||
- wmsa
|
||||
prometheus:
|
||||
image: "prom/prometheus"
|
||||
container_name: "prometheus"
|
||||
command:
|
||||
- "--config.file=/etc/prometheus/prometheus.yml"
|
||||
ports:
|
||||
- "127.0.0.1:8091:9090"
|
||||
volumes:
|
||||
- "./run/prometheus.yml:/etc/prometheus/prometheus.yml"
|
||||
- "/var/run/docker.sock:/var/run/docker.sock:ro"
|
||||
networks:
|
||||
- wmsa
|
||||
networks:
|
||||
wmsa:
|
||||
volumes:
|
||||
db:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/db
|
||||
logs:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/logs
|
||||
model:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/model
|
||||
conf:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/conf
|
||||
data:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/data
|
||||
samples-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/samples
|
||||
index-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/index
|
||||
work-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/work
|
||||
backup-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/backup
|
||||
uploads-1:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-1/uploads
|
||||
samples-2:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-2/samples
|
||||
index-2:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-2/index
|
||||
work-2:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-2/work
|
||||
backup-2:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-2/backup
|
||||
uploads-2:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: run/node-2/uploads
|
@ -1,59 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
# Check if wget exists
|
||||
if command -v wget &> /dev/null; then
|
||||
dl_prg="wget -O"
|
||||
elif command -v curl &> /dev/null; then
|
||||
dl_prg="curl -o"
|
||||
else
|
||||
echo "Neither wget nor curl found, exiting .."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
case "$1" in
|
||||
"s"|"m"|"l"|"xl")
|
||||
;;
|
||||
*)
|
||||
echo "Invalid argument. Must be one of 's', 'm', 'l' or 'xl'."
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
SAMPLE_NAME=crawl-${1:-m}
|
||||
SAMPLE_DIR="node-1/samples/${SAMPLE_NAME}/"
|
||||
|
||||
function download_model {
|
||||
model=$1
|
||||
url=$2
|
||||
|
||||
if [ ! -f $model ]; then
|
||||
echo "** Downloading $url"
|
||||
$dl_prg $model $url
|
||||
fi
|
||||
}
|
||||
|
||||
pushd $(dirname $0)
|
||||
|
||||
if [ -d ${SAMPLE_DIR} ]; then
|
||||
echo "${SAMPLE_DIR} already exists; remove it if you want to re-download the sample"
|
||||
fi
|
||||
|
||||
mkdir -p node-1/samples/
|
||||
SAMPLE_TARBALL=samples/${SAMPLE_NAME}.tar.gz
|
||||
download_model ${SAMPLE_TARBALL}.tmp https://downloads.marginalia.nu/${SAMPLE_TARBALL} && mv ${SAMPLE_TARBALL}.tmp ${SAMPLE_TARBALL}
|
||||
|
||||
if [ ! -f ${SAMPLE_TARBALL} ]; then
|
||||
echo "!! Failed"
|
||||
exit 255
|
||||
fi
|
||||
|
||||
mkdir -p ${SAMPLE_DIR}
|
||||
tar zxf ${SAMPLE_TARBALL} --strip-components=1 -C ${SAMPLE_DIR}
|
||||
|
||||
cat > "${SAMPLE_DIR}/marginalia-manifest.json" <<EOF
|
||||
{ "description": "Sample data set ${SAMPLE_NAME}", "type": "CRAWL_DATA" }
|
||||
EOF
|
||||
|
||||
popd
|
131
run/readme.md
@ -1,8 +1,10 @@
|
||||
# Run
|
||||
|
||||
When developing locally, this directory will contain run-time data required for
|
||||
the search engine. In a clean check-out, it only contains the tools required to
|
||||
bootstrap this directory structure.
|
||||
This directory is a staging area for running the system. It contains scripts
|
||||
and templates for installing the system on a server, and for running it locally.
|
||||
|
||||
See [https://docs.marginalia.nu/](https://docs.marginalia.nu/) for additional
|
||||
documentation.
|
||||
|
||||
## Requirements
|
||||
|
||||
@ -16,8 +18,7 @@ graalce is a good distribution choice but it doesn't matter too much.
|
||||
## Set up
|
||||
|
||||
To go from a clean check out of the git repo to a running search engine,
|
||||
follow these steps. This assumes a test deployment. For a production like
|
||||
setup... (TODO: write a guide for this).
|
||||
follow these steps.
|
||||
|
||||
You're assumed to sit in the project root the whole time.
|
||||
|
||||
@ -35,106 +36,40 @@ $ run/setup.sh
|
||||
```shell
|
||||
$ ./gradlew docker
|
||||
```
|
||||
|
||||
### 3. Initialize the database
|
||||
|
||||
Before the system can be brought online, the database needs to be initialized. To do this,
|
||||
bring up the database in the background, and run the flyway migration tool.
|
||||
### 3. Install the system
|
||||
|
||||
```shell
|
||||
$ docker-compose up -d mariadb
|
||||
$ ./gradlew flywayMigrate
|
||||
$ run/install.sh <install-directory>
|
||||
```
|
||||
|
||||
### 4. Bring the system online.
|
||||
To install the system, you need to run the install script. It will prompt
|
||||
you for which installation mode you want to use. The options are:
|
||||
|
||||
We'll run it in the foreground in the terminal this time because it's educational to see the logs.
|
||||
Add `-d` to run in the background.
|
||||
1. Barebones - This will install a white-label search engine with no data. You can
|
||||
use this to index your own data. It disables and hides functionality that is strongly
|
||||
related to the Marginalia project, such as the Marginalia GUI.
|
||||
2. Full Marginalia Search instance - This will install an instance of the search engine
|
||||
configured like [search.marginalia.nu](https://search.marginalia.nu). This is useful
|
||||
for local development and testing.
|
||||
|
||||
It will also prompt you for account details for a new mariadb instance, which will be
|
||||
created for you. The database will be initialized with the schema and data required
|
||||
for the search engine to run.
|
||||
|
||||
After filling out all the details, the script will copy the installation files to the
|
||||
specified directory.
|
||||
|
||||
### 4. Run the system
|
||||
|
||||
```shell
|
||||
$ docker-compose up
|
||||
$ cd install_directory
|
||||
$ docker-compose up -d
|
||||
# To see the logs:
|
||||
$ docker-compose logs -f
|
||||
```
|
||||
|
||||
There are two docker-compose files available, `docker-compose.yml` and `docker-compose-barebones.yml`;
|
||||
the latter is a stripped down version that only runs the bare minimum required to run the system, for e.g.
|
||||
running a whitelabel version of the system. The former is the full system with all the frills of
|
||||
Marginalia Search, and is the one used by default.
|
||||
You can now access a search interface at `http://localhost:8080`, and the admin interface
|
||||
at `http://localhost:8081/`.
|
||||
|
||||
To start the barebones version, run:
|
||||
|
||||
```shell
|
||||
$ docker-compose -f docker-compose-barebones.yml up
|
||||
```
|
||||
|
||||
### 5. You should now be able to access the system.
|
||||
|
||||
By default, the docker-compose file publishes the following ports:
|
||||
|
||||
| Address | Description |
|
||||
|-------------------------|------------------|
|
||||
| http://localhost:8080/ | User-facing GUI |
|
||||
| http://localhost:8081/ | Operator's GUI |
|
||||
|
||||
Note that the operator's GUI does not perform any sort of authentication.
|
||||
Preferably don't expose it publicly, but if you absolutely must, use a proxy or
|
||||
Basic Auth to add security.
|
||||
|
||||
### 6. Download Sample Data
|
||||
|
||||
A script is available for downloading sample data. The script will download the
|
||||
data from https://downloads.marginalia.nu/ and extract it to the correct location.
|
||||
|
||||
The system will pick the data up automatically.
|
||||
|
||||
```shell
|
||||
$ run/download-samples.sh l
|
||||
```
|
||||
|
||||
Four sets are available:
|
||||
|
||||
| Name | Description |
|
||||
|------|---------------------------------|
|
||||
| s | Small set, 1000 domains |
|
||||
| m | Medium set, 2000 domains |
|
||||
| l | Large set, 5000 domains |
|
||||
| xl | Extra large set, 50,000 domains |
|
||||
|
||||
Warning: The XL set is intended to provide a large amount of data for
|
||||
setting up a pre-production environment. It may be hard to run on a smaller
|
||||
machine and will on most machines take several hours to process.
|
||||
|
||||
The 'm' or 'l' sets are a good compromise between size and processing time
|
||||
and should work on most machines.
|
||||
|
||||
### 7. Process the data
|
||||
|
||||
Bring the system online if it isn't (see step 4), then go to the operator's
|
||||
GUI (see step 5).
|
||||
|
||||
* Go to `Node 1 -> Storage -> Crawl Data`
|
||||
* Hit the toggle to set your crawl data to be active
|
||||
* Go to `Actions -> Process Crawl Data -> [Trigger Reprocessing]`
|
||||
|
||||
This will take anywhere between a few minutes to a few hours depending on which
|
||||
data set you downloaded. You can monitor the progress from the `Overview` tab.
|
||||
|
||||
First the CONVERTER is expected to run; this will process the data into a format
|
||||
that can easily be inserted into the database and index.
|
||||
|
||||
Next the LOADER will run; this will insert the data into the database and index.
|
||||
|
||||
Next the link database will repartition itself, and finally the index will be
|
||||
reconstructed. You can view the process of these steps in the `Jobs` listing.
|
||||
|
||||
### 8. Run the system
|
||||
|
||||
Once all this is done, you can go to the user-facing GUI (see step 5) and try
|
||||
a search.
|
||||
|
||||
Important! Use the 'No Ranking' option when running locally, since you'll very
|
||||
likely not have enough links for the ranking algorithm to perform well.
|
||||
|
||||
## Experiment Runner
|
||||
|
||||
The script `experiment.sh` is a launcher for the experiment runner, which is useful when
|
||||
evaluating new algorithms in processing crawl data.
|
||||
There is no data in the system yet. To load data into the system,
|
||||
see the guide at [https://docs.marginalia.nu/](https://docs.marginalia.nu/).
|
||||
|