(doc) Migrate documentation https://docs.marginalia.nu/
13
README.md
@ -13,12 +13,21 @@ The long term plan is to refine the search engine so that it provide enough publ
|
|||||||
that the project can be funded through grants, donations and commercial API licenses
|
that the project can be funded through grants, donations and commercial API licenses
|
||||||
(non-commercial share-alike is always free).
|
(non-commercial share-alike is always free).
|
||||||
|
|
||||||
|
The system can both be run as a copy of Marginalia Search, or as a white-label search engine
|
||||||
|
for your own data (either crawled or side-loaded). At present the logic isn't very configurable, and a lot of the judgements
|
||||||
|
made are based on the Marginalia project's goals, but additional configurability is being
|
||||||
|
worked on!
|
||||||
|
|
||||||
## Set up
|
## Set up
|
||||||
|
|
||||||
Start by running [⚙️ run/setup.sh](run/setup.sh). This will download supplementary model data that is necessary to run the code.
|
To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)!
|
||||||
|
|
||||||
|
Further documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/).
|
||||||
|
|
||||||
|
Before compiling, it's necessary to run [⚙️ run/setup.sh](run/setup.sh).
|
||||||
|
This will download supplementary model data that is necessary to run the code.
|
||||||
These are also necessary to run the tests.
|
These are also necessary to run the tests.
|
||||||
|
|
||||||
To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)!
|
|
||||||
|
|
||||||
## Hardware Requirements
|
## Hardware Requirements
|
||||||
|
|
||||||
|
115
doc/crawling.md
@ -1,115 +0,0 @@
|
|||||||
# Crawling
|
|
||||||
|
|
||||||
## WARNING
|
|
||||||
|
|
||||||
Please don't run the crawler unless you intend to actually operate a public
|
|
||||||
facing search engine! For testing, use crawl sets from [downloads.marginalia.nu](https://downloads.marginalia.nu/) instead;
|
|
||||||
or if you wish to play with the crawler, crawl a small set of domains from people who are
|
|
||||||
ok with it, use your own, your friends, or any subdomain from marginalia.nu.
|
|
||||||
|
|
||||||
See the documentation in run/ for more information on how to load sample data!
|
|
||||||
|
|
||||||
Reckless crawling annoys webmasters and makes it harder to run an independent search engine.
|
|
||||||
Crawling from a domestic IP address is also likely to put you on a greylist
|
|
||||||
of probable bots. You will solve CAPTCHAs for almost every website you visit
|
|
||||||
for weeks, and may be permanently blocked from a few IPs.
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of
|
|
||||||
DNS traffic.
|
|
||||||
|
|
||||||
These processes require a lot of disk space. It's strongly recommended to use a dedicated disk for
|
|
||||||
the index storage subdirectory, it doesn't need to be extremely fast, but it should be a few terabytes in size.
|
|
||||||
|
|
||||||
It should be mounted with `noatime`. It may be a good idea to format the disk with a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler.
|
|
||||||
|
|
||||||
Make sure you configure the user-agent properly. This will be used to identify the crawler,
|
|
||||||
and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it.
|
|
||||||
See [wiki://Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard) for more information
|
|
||||||
about robots.txt; the user agent can be configured in conf/properties/system.properties; see the
|
|
||||||
[system-properties](system-properties.md) documentation for more information.
|
|
||||||
|
|
||||||
## Setup
|
|
||||||
|
|
||||||
Ensure that the system is running and go to https://localhost:8081.
|
|
||||||
|
|
||||||
With the default test configuration, the system is configured to
|
|
||||||
store data in `node-1/storage`.
|
|
||||||
|
|
||||||
## Fresh Crawl
|
|
||||||
|
|
||||||
While a running search engine can use the link database to figure out which websites to visit, a clean
|
|
||||||
system does not know of any links. To bootstrap a crawl, a crawl specification needs to be created to
|
|
||||||
seed the domain database.
|
|
||||||
|
|
||||||
Go to `Nodes->Node 1->Actions->New Crawl`
|
|
||||||
|
|
||||||
![img](images/new_crawl.png)
|
|
||||||
|
|
||||||
Click the link that says 'New Spec' to arrive at a form for creating a new specification:
|
|
||||||
|
|
||||||
![img](images/new_spec.png)
|
|
||||||
|
|
||||||
Fill out the form with a description and a link to a domain list. The domain list is a text file
|
|
||||||
with one domain per line, with blank lines and comments starting with `#` ignored. You can use
|
|
||||||
github raw links for this purpose. For test purposes, you can use this link:
|
|
||||||
`https://downloads.marginalia.nu/domain-list-test.txt`, which will create a crawl for a few
|
|
||||||
of marignalia.nu's subdomains.
|
|
||||||
|
|
||||||
If you aren't redirected there automatically, go back to the `New Crawl` page under Node 1 -> Actions.
|
|
||||||
Your new specification should now be listed.
|
|
||||||
|
|
||||||
Check the box next to it, and click `[Trigger New Crawl]`.
|
|
||||||
|
|
||||||
![img](images/new_crawl2.png)
|
|
||||||
|
|
||||||
This will start the crawling process. Crawling may take a while, depending on the size
|
|
||||||
of the domain list and the size of the websites.
|
|
||||||
|
|
||||||
![img](images/crawl_in_progress.png)
|
|
||||||
|
|
||||||
Eventually a process bar will show up, and the crawl will start. When it reaches 100%, the crawl is done.
|
|
||||||
You can also monitor the `Events Summary` table on the same page to see what happened after the fact.
|
|
||||||
|
|
||||||
It is expected that the crawl will stall out toward the end of the process, this is a statistical effect since
|
|
||||||
the largest websites take the longest to finish, and tend to be the ones lingering at 99% or so completion. The
|
|
||||||
crawler has a timeout of 5 hours, where if no new domains are finished crawling, it will stop, to prevent crawler traps
|
|
||||||
from stalling the crawl indefinitely.
|
|
||||||
|
|
||||||
**Be sure to read the section on re-crawling!**
|
|
||||||
|
|
||||||
## Converting
|
|
||||||
|
|
||||||
Once the crawl is done, the data needs to be processed before its searchable. This is done by going to
|
|
||||||
`Nodes->Node 1->Actions->Process Crawl Data`.
|
|
||||||
|
|
||||||
![Conversion screenshot](images/convert.png)
|
|
||||||
|
|
||||||
This will start the conversion process. This will again take a while, depending on the size of the crawl.
|
|
||||||
The process bar will show the progress. When it reaches 100%, the conversion is done, and the data will begin
|
|
||||||
loading automatically. A cascade of actions is performed in sequence, leading to the data being loaded into the
|
|
||||||
search engine and an index being constructed. This is all automatic, but depending on the size of the crawl data,
|
|
||||||
may take a while.
|
|
||||||
|
|
||||||
When an event `INDEX-SWITCH-OK` is logged in the `Event Summary` table, the data is ready to be searched.
|
|
||||||
|
|
||||||
## Re-crawling
|
|
||||||
|
|
||||||
The work flow with a crawl spec was a one-off process to bootstrap the search engine. To keep the search engine up to date,
|
|
||||||
it is preferable to do a recrawl. This will try to reduce the amount of data that needs to be fetched.
|
|
||||||
|
|
||||||
To trigger a Recrawl, go to `Nodes->Node 1->Actions->Re-crawl`. This will bring you to a page that looks similar to the
|
|
||||||
first crawl page, where you can select a set of crawl data to use as a source. Select the crawl data you want, and
|
|
||||||
press `[Trigger Recrawl]`.
|
|
||||||
|
|
||||||
Crawling will proceed as before, but this time, the crawler will try to fetch only the data that has changed since the
|
|
||||||
last crawl, increasing the number of documents by a percentage. This will typically be much faster than the initial crawl.
|
|
||||||
|
|
||||||
### Growing the crawl set
|
|
||||||
|
|
||||||
The re-crawl will also pull new domains from the `New Domains` dataset, which is an URL configurable in
|
|
||||||
`[Top Menu] -> System -> Data Sets`. If a new domain is found, it will be assigned to the present node, and crawled in
|
|
||||||
the re-crawl.
|
|
||||||
|
|
||||||
![Datasets screenshot](images/datasets.png)
|
|
Before Width: | Height: | Size: 42 KiB |
Before Width: | Height: | Size: 62 KiB |
Before Width: | Height: | Size: 42 KiB |
Before Width: | Height: | Size: 44 KiB |
Before Width: | Height: | Size: 35 KiB |
Before Width: | Height: | Size: 18 KiB |
Before Width: | Height: | Size: 13 KiB |
Before Width: | Height: | Size: 33 KiB |
Before Width: | Height: | Size: 48 KiB |
Before Width: | Height: | Size: 44 KiB |
@ -3,12 +3,11 @@
|
|||||||
A lot of the architectural description is sprinkled into the code repository closer to the code.
|
A lot of the architectural description is sprinkled into the code repository closer to the code.
|
||||||
Start in [📁 ../code/](../code/) and poke around.
|
Start in [📁 ../code/](../code/) and poke around.
|
||||||
|
|
||||||
|
Operational documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/).
|
||||||
|
|
||||||
## Operations
|
## Operations
|
||||||
|
|
||||||
* [System Properties](system-properties.md) - JVM property flags
|
|
||||||
|
|
||||||
## How-To
|
## How-To
|
||||||
* [Sideloading How-To](sideloading-howto.md) - How to sideload various data sets
|
|
||||||
* [Parquet How-To](parquet-howto.md) - Useful tips in working with Parquet files
|
* [Parquet How-To](parquet-howto.md) - Useful tips in working with Parquet files
|
||||||
|
|
||||||
## Set-up
|
## Set-up
|
||||||
|
@ -1,211 +0,0 @@
|
|||||||
# Sideloading How-To
|
|
||||||
|
|
||||||
Some websites are much larger than others, this includes
|
|
||||||
Wikipedia, Stack Overflow, and a few others. They are so
|
|
||||||
large they are impractical to crawl in the traditional fashion,
|
|
||||||
but luckily they make available data dumps that can be processed
|
|
||||||
and loaded into the search engine through other means.
|
|
||||||
|
|
||||||
To this end, it's possible to sideload data into the search engine
|
|
||||||
from other sources than the web crawler.
|
|
||||||
|
|
||||||
## Index Nodes
|
|
||||||
|
|
||||||
In practice, if you want to sideload data, you need to do it on
|
|
||||||
a separate index node. Index nodes are separate instances of the
|
|
||||||
index software. The default configuration is to have two index nodes,
|
|
||||||
one for the web crawler, and one for sideloaded data.
|
|
||||||
|
|
||||||
The need for a separate node is due to incompatibilities in the work flows.
|
|
||||||
|
|
||||||
It is also a good idea in general, as very large domains can easily be so large that the entire time budget
|
|
||||||
for the query is spent sifting through documents from that one domain, this is
|
|
||||||
especially true with something like Wikipedia, which has a lot of documents at
|
|
||||||
least tangentially related to any given topic.
|
|
||||||
|
|
||||||
This how-to assumes that you are operating on index-node 2.
|
|
||||||
|
|
||||||
## Notes on the upload directory
|
|
||||||
|
|
||||||
This is written assuming that the system is installed with the `install.sh`
|
|
||||||
script, which deploys the system with docker-compose, and has a directory
|
|
||||||
structure like
|
|
||||||
|
|
||||||
```
|
|
||||||
...
|
|
||||||
index-1/backup/
|
|
||||||
index-1/index/
|
|
||||||
index-1/storage/
|
|
||||||
index-1/uploads/
|
|
||||||
index-1/work/
|
|
||||||
index-2/backup/
|
|
||||||
index-2/index/
|
|
||||||
index-2/storage/
|
|
||||||
index-2/uploads/
|
|
||||||
index-2/work/
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
We're going to be putting files in the **uploads** directories. If you have installed
|
|
||||||
the system in some other way, or changed the configuration significantly, you need
|
|
||||||
to adjust the paths accordingly.
|
|
||||||
|
|
||||||
## Sideloading
|
|
||||||
|
|
||||||
The sideloading actions are available through Actions menu in each node.
|
|
||||||
|
|
||||||
![Sideload menu](images/sideload_menu.png)
|
|
||||||
|
|
||||||
## Sideloading WARCs
|
|
||||||
|
|
||||||
WARC files are the standard format for web archives. They can be created e.g. with wget.
|
|
||||||
The Marginalia software can read WARC files directly, and sideload them into the index,
|
|
||||||
as long as each warc file contains only one domain.
|
|
||||||
|
|
||||||
Let's for example archive www.marginalia.nu (I own this domain, so feel free to try this at home)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ wget -r --warc-file=marginalia www.marginalia.nu
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note** If you intend to do this on other websites, you should probably add a `--wait` parameter to wget,
|
|
||||||
e.g. `wget --wait=1 -r --warc-file=...` to avoid hammering the website with requests and getting blocked.
|
|
||||||
|
|
||||||
This will take a moment, and create a file called `marginalia.warc.gz`. We move it to the
|
|
||||||
upload directory of the index node, and sideload it through the Actions menu.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ mkdir -p index-2/uploads/marginalia-warc
|
|
||||||
$ mv marginalia.warc.gz index-2/uploads/marginalia-warc
|
|
||||||
```
|
|
||||||
|
|
||||||
Go to the Actions menu, and select the "Sideload WARC" action. This will show a list of
|
|
||||||
subdirectories in the Uploads directory. Select the directory containing the WARC file, and
|
|
||||||
click "Sideload".
|
|
||||||
|
|
||||||
![Sideload WARC screenshot](images/sideload_warc.png)
|
|
||||||
|
|
||||||
This should take you to the node overview, where you can see the progress of the sideloading.
|
|
||||||
It will take a moment, as the WARC file is being processed.
|
|
||||||
|
|
||||||
![Processing in progress](images/convert_2.png)
|
|
||||||
|
|
||||||
It will not be loaded automatically. This is to permit you to sideload multiple sources.
|
|
||||||
|
|
||||||
When you are ready to load it, go to the Actions menu, and select "Load Crawl Data".
|
|
||||||
|
|
||||||
![Load Crawl Data](images/load_warc.png)
|
|
||||||
|
|
||||||
Select all the sources you want to load, and click "Load". This will load the data into the
|
|
||||||
index, and make it available for searching.
|
|
||||||
|
|
||||||
## Sideloading Wikipedia
|
|
||||||
|
|
||||||
Due to licensing incompatibilities with OpenZim's GPL-2 and AGPL, the workflow
|
|
||||||
depends on using the conversion process from [https://encyclopedia.marginalia.nu/](https://encyclopedia.marginalia.nu/)
|
|
||||||
to pre-digest the data.
|
|
||||||
|
|
||||||
Build the [encyclopedia.marginalia.nu Code](https://github.com/MarginaliaSearch/encyclopedia.marginalia.nu)
|
|
||||||
and follow the instructions for downloading a ZIM file, and then run something like
|
|
||||||
|
|
||||||
```$./encyclopedia convert file.zim articles.db```
|
|
||||||
|
|
||||||
This db-file can be processed and loaded into the search engine through the
|
|
||||||
Actions view.
|
|
||||||
|
|
||||||
FIXME: It will currently only point to en.wikipedia.org, this should be
|
|
||||||
made configurable.
|
|
||||||
|
|
||||||
|
|
||||||
## Sideloading a directory tree
|
|
||||||
|
|
||||||
For relatively small websites, ad-hoc side-loading is available directly from a
|
|
||||||
folder structure on the hard drive. This is intended for loading manuals,
|
|
||||||
documentation and similar data sets that are large and slowly changing.
|
|
||||||
|
|
||||||
A website can be archived with wget, like this
|
|
||||||
|
|
||||||
```bash
|
|
||||||
UA="search.marginalia.nu" \
|
|
||||||
DOMAIN="www.example.com" \
|
|
||||||
wget -nc -x --continue -w 1 -r -U ${UA} -A "html" ${DOMAIN}
|
|
||||||
```
|
|
||||||
|
|
||||||
After doing this to a bunch of websites, create a YAML file something like this:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
sources:
|
|
||||||
- name: jdk-20
|
|
||||||
dir: "jdk-20/"
|
|
||||||
domainName: "docs.oracle.com"
|
|
||||||
baseUrl: "https://docs.oracle.com/en/java/javase/20/docs"
|
|
||||||
keywords:
|
|
||||||
- "java"
|
|
||||||
- "docs"
|
|
||||||
- "documentation"
|
|
||||||
- "javadoc"
|
|
||||||
- name: python3
|
|
||||||
dir: "python-3.11.5/"
|
|
||||||
domainName: "docs.python.org"
|
|
||||||
baseUrl: "https://docs.python.org/3/"
|
|
||||||
keywords:
|
|
||||||
- "python"
|
|
||||||
- "docs"
|
|
||||||
- "documentation"
|
|
||||||
- name: mariadb.com
|
|
||||||
dir: "mariadb.com/"
|
|
||||||
domainName: "mariadb.com"
|
|
||||||
baseUrl: "https://mariadb.com/"
|
|
||||||
keywords:
|
|
||||||
- "sql"
|
|
||||||
- "docs"
|
|
||||||
- "mariadb"
|
|
||||||
- "mysql"
|
|
||||||
```
|
|
||||||
|
|
||||||
|parameter|description|
|
|
||||||
|----|----|
|
|
||||||
|name|Purely informative|
|
|
||||||
|dir|Path of website contents relative to the location of the yaml file|
|
|
||||||
|domainName|The domain name of the website|
|
|
||||||
|baseUrl|This URL will be prefixed to the contents of `dir`|
|
|
||||||
|keywords|These supplemental keywords will be injected in each document|
|
|
||||||
|
|
||||||
The directory structure corresponding to the above might look like
|
|
||||||
|
|
||||||
```
|
|
||||||
docs-index.yaml
|
|
||||||
jdk-20/
|
|
||||||
jdk-20/resources/
|
|
||||||
jdk-20/api/
|
|
||||||
jdk-20/api/[...]
|
|
||||||
jdk-20/specs/
|
|
||||||
jdk-20/specs/[...]
|
|
||||||
jdk-20/index.html
|
|
||||||
mariadb.com
|
|
||||||
mariadb.com/kb/
|
|
||||||
mariadb.com/kb/[...]
|
|
||||||
python-3.11.5
|
|
||||||
python-3.11.5/genindex-B.html
|
|
||||||
python-3.11.5/library/
|
|
||||||
python-3.11.5/distutils/
|
|
||||||
python-3.11.5/[...]
|
|
||||||
[...]
|
|
||||||
```
|
|
||||||
|
|
||||||
This yaml-file can be processed and loaded into the search engine through the
|
|
||||||
Actions view.
|
|
||||||
|
|
||||||
|
|
||||||
## Sideloading Stack Overflow/Stackexchange
|
|
||||||
|
|
||||||
Stackexchange makes dumps available on Archive.org. These are unfortunately on a format that
|
|
||||||
needs some heavy-handed pre-processing before they can be loaded. A tool is available for
|
|
||||||
this in [tools/stackexchange-converter](../code/tools/stackexchange-converter).
|
|
||||||
|
|
||||||
After running `gradlew dist`, this tool is found in `build/dist/stackexchange-converter`,
|
|
||||||
follow the instructions in the stackexchange-converter readme, and
|
|
||||||
convert the stackexchange xml.7z-files to sqlite db-files.
|
|
||||||
|
|
||||||
A directory with such db-files can be processed and loaded into the
|
|
||||||
search engine through the Actions view.
|
|
@ -1,42 +0,0 @@
|
|||||||
# System Properties
|
|
||||||
|
|
||||||
These are JVM system properties used by each service. These properties can either
|
|
||||||
be loaded from a file or passed in as command line arguments, using `$JAVA_OPTS`.
|
|
||||||
|
|
||||||
The system will look for a properties file in `conf/properties/system.properties`,
|
|
||||||
within the install dir, as specified by `$WMSA_HOME`.
|
|
||||||
|
|
||||||
A template is available in [../run/template/conf/properties/system.properties](../run/template/conf/properties/system.properties).
|
|
||||||
|
|
||||||
## Global
|
|
||||||
|
|
||||||
| flag | values | description |
|
|
||||||
|-------------|------------|--------------------------------------|
|
|
||||||
| blacklist.disable | boolean | Disables the IP blacklist |
|
|
||||||
| flyway.disable | boolean | Disables automatic Flyway migrations |
|
|
||||||
|
|
||||||
## Crawler Properties
|
|
||||||
|
|
||||||
| flag | values | description |
|
|
||||||
|------------------------------|------------|---------------------------------------------------------------------------------------------|
|
|
||||||
| crawler.userAgentString | string | Sets the user agent string used by the crawler |
|
|
||||||
| crawler.userAgentIdentifier | string | Sets the user agent identifier used by the crawler, e.g. what it looks for in robots.txt |
|
|
||||||
| crawler.poolSize | integer | Sets the number of threads used by the crawler, more is faster, but uses more RAM |
|
|
||||||
| crawler.initialUrlsPerDomain | integer | Sets the initial number of URLs to crawl per domain (when crawling from spec) |
|
|
||||||
| crawler.maxUrlsPerDomain | integer | Sets the maximum number of URLs to crawl per domain (when recrawling) |
|
|
||||||
| crawler.minUrlsPerDomain | integer | Sets the minimum number of URLs to crawl per domain (when recrawling) |
|
|
||||||
| crawler.crawlSetGrowthFactor | double | If 100 documents were fetched last crawl, increase the goal to 100 x (this value) this time |
|
|
||||||
| ip-blocklist.disabled | boolean | Disables the IP blocklist |
|
|
||||||
|
|
||||||
## Converter Properties
|
|
||||||
|
|
||||||
| flag | values | description |
|
|
||||||
|-----------------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
||||||
| converter.sideloadThreshold | integer | Threshold value, in number of documents per domain, where a simpler processing method is used which uses less RAM. 10,000 is a good value for ~32GB RAM |
|
|
||||||
|
|
||||||
# Marginalia Application Specific
|
|
||||||
|
|
||||||
| flag | values | description |
|
|
||||||
|---------------------------|------------|---------------------------------------------------------------|
|
|
||||||
| search.websiteUrl | string | Overrides the website URL used in rendering |
|
|
||||||
| control.hideMarginaliaApp | boolean | Hides the Marginalia application from the control GUI results |
|
|
@ -1,181 +0,0 @@
|
|||||||
# This is the barebones docker-compose file for the Marginalia Search Engine.
|
|
||||||
#
|
|
||||||
# It starts a stripped-down version of the search engine, with only the essential
|
|
||||||
# services running, including the database, the query service, the control service,
|
|
||||||
# and a single index and executor node.
|
|
||||||
#
|
|
||||||
# It is a good starting point for setting up a white-label search engine that does not
|
|
||||||
# have Marginalia's GUI. The Query Service presents a simple search box, that also talks
|
|
||||||
# JSON, so you can use it as a backend for your own search interface.
|
|
||||||
|
|
||||||
|
|
||||||
x-svc: &service
|
|
||||||
env_file:
|
|
||||||
- "run/env/service.env"
|
|
||||||
volumes:
|
|
||||||
- conf:/wmsa/conf:ro
|
|
||||||
- model:/wmsa/model
|
|
||||||
- data:/wmsa/data
|
|
||||||
- logs:/var/log/wmsa
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
depends_on:
|
|
||||||
- mariadb
|
|
||||||
labels:
|
|
||||||
- "__meta_docker_port_private=7000"
|
|
||||||
x-p1: &partition-1
|
|
||||||
env_file:
|
|
||||||
- "run/env/service.env"
|
|
||||||
volumes:
|
|
||||||
- conf:/wmsa/conf:ro
|
|
||||||
- model:/wmsa/model
|
|
||||||
- data:/wmsa/data
|
|
||||||
- logs:/var/log/wmsa
|
|
||||||
- index-1:/idx
|
|
||||||
- work-1:/work
|
|
||||||
- backup-1:/backup
|
|
||||||
- samples-1:/storage
|
|
||||||
- uploads-1:/uploads
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
depends_on:
|
|
||||||
- mariadb
|
|
||||||
environment:
|
|
||||||
- "WMSA_SERVICE_NODE=1"
|
|
||||||
|
|
||||||
services:
|
|
||||||
index-service-1:
|
|
||||||
<<: *partition-1
|
|
||||||
image: "marginalia/index-service"
|
|
||||||
container_name: "index-service-1"
|
|
||||||
executor-service-1:
|
|
||||||
<<: *partition-1
|
|
||||||
image: "marginalia/executor-service"
|
|
||||||
container_name: "executor-service-1"
|
|
||||||
query-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/query-service"
|
|
||||||
container_name: "query-service"
|
|
||||||
expose:
|
|
||||||
- 80
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.search-service.rule=PathPrefix(`/`)"
|
|
||||||
- "traefik.http.routers.search-service.entrypoints=search"
|
|
||||||
- "traefik.http.routers.search-service.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.search-service.middlewares=add-public"
|
|
||||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
|
||||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
|
||||||
control-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/control-service"
|
|
||||||
container_name: "control-service"
|
|
||||||
expose:
|
|
||||||
- 80
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.control-service.rule=PathPrefix(`/`)"
|
|
||||||
- "traefik.http.routers.control-service.entrypoints=control"
|
|
||||||
- "traefik.http.routers.control-service.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.control-service.middlewares=add-public"
|
|
||||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
|
||||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
|
||||||
mariadb:
|
|
||||||
image: "mariadb:lts"
|
|
||||||
container_name: "mariadb"
|
|
||||||
env_file: "run/env/mariadb.env"
|
|
||||||
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
|
|
||||||
ports:
|
|
||||||
- "127.0.0.1:3306:3306/tcp"
|
|
||||||
healthcheck:
|
|
||||||
test: mysqladmin ping -h 127.0.0.1 -u $$MARIADB_USER --password=$$MARIADB_PASSWORD
|
|
||||||
start_period: 5s
|
|
||||||
interval: 5s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 60
|
|
||||||
volumes:
|
|
||||||
- db:/var/lib/mysql
|
|
||||||
- "./code/common/db/src/main/resources/sql/current/:/docker-entrypoint-initdb.d/"
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
traefik:
|
|
||||||
image: "traefik:v2.10"
|
|
||||||
container_name: "traefik"
|
|
||||||
command:
|
|
||||||
#- "--log.level=DEBUG"
|
|
||||||
- "--api.insecure=true"
|
|
||||||
- "--providers.docker=true"
|
|
||||||
- "--providers.docker.exposedbydefault=false"
|
|
||||||
- "--entrypoints.search.address=:80"
|
|
||||||
- "--entrypoints.control.address=:81"
|
|
||||||
ports:
|
|
||||||
- "127.0.0.1:8080:80"
|
|
||||||
- "127.0.0.1:8081:81"
|
|
||||||
- "127.0.0.1:8090:8080"
|
|
||||||
volumes:
|
|
||||||
- "/var/run/docker.sock:/var/run/docker.sock:ro"
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
networks:
|
|
||||||
wmsa:
|
|
||||||
volumes:
|
|
||||||
db:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/db
|
|
||||||
logs:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/logs
|
|
||||||
model:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/model
|
|
||||||
conf:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/conf
|
|
||||||
data:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/data
|
|
||||||
samples-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/samples
|
|
||||||
index-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/index
|
|
||||||
work-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/work
|
|
||||||
backup-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/backup
|
|
||||||
uploads-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/uploads
|
|
@ -1,315 +0,0 @@
|
|||||||
# This is the full docker-compose.yml file for the Marginalia Search Engine.
|
|
||||||
#
|
|
||||||
# It starts all the services, including the GUI, the database, the query service,
|
|
||||||
# two nodes for demo purposes, as well as a bunch of peripheral services that are
|
|
||||||
# application specific.
|
|
||||||
#
|
|
||||||
|
|
||||||
x-svc: &service
|
|
||||||
env_file:
|
|
||||||
- "run/env/service.env"
|
|
||||||
volumes:
|
|
||||||
- conf:/wmsa/conf:ro
|
|
||||||
- model:/wmsa/model
|
|
||||||
- data:/wmsa/data
|
|
||||||
- logs:/var/log/wmsa
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
labels:
|
|
||||||
- "__meta_docker_port_private=7000"
|
|
||||||
x-p1: &partition-1
|
|
||||||
env_file:
|
|
||||||
- "run/env/service.env"
|
|
||||||
volumes:
|
|
||||||
- conf:/wmsa/conf:ro
|
|
||||||
- model:/wmsa/model
|
|
||||||
- data:/wmsa/data
|
|
||||||
- logs:/var/log/wmsa
|
|
||||||
- index-1:/idx
|
|
||||||
- work-1:/work
|
|
||||||
- backup-1:/backup
|
|
||||||
- samples-1:/storage
|
|
||||||
- uploads-1:/uploads
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
depends_on:
|
|
||||||
- mariadb
|
|
||||||
environment:
|
|
||||||
- "WMSA_SERVICE_NODE=1"
|
|
||||||
x-p2: &partition-2
|
|
||||||
env_file:
|
|
||||||
- "run/env/service.env"
|
|
||||||
volumes:
|
|
||||||
- conf:/wmsa/conf:ro
|
|
||||||
- model:/wmsa/model
|
|
||||||
- data:/wmsa/data
|
|
||||||
- logs:/var/log/wmsa
|
|
||||||
- index-2:/idx
|
|
||||||
- work-2:/work
|
|
||||||
- backup-2:/backup
|
|
||||||
- samples-2:/storage
|
|
||||||
- uploads-2:/uploads
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
depends_on:
|
|
||||||
mariadb:
|
|
||||||
condition: service_healthy
|
|
||||||
environment:
|
|
||||||
- "WMSA_SERVICE_NODE=2"
|
|
||||||
|
|
||||||
services:
|
|
||||||
index-service-1:
|
|
||||||
<<: *partition-1
|
|
||||||
image: "marginalia/index-service"
|
|
||||||
container_name: "index-service-1"
|
|
||||||
executor-service-1:
|
|
||||||
<<: *partition-1
|
|
||||||
image: "marginalia/executor-service"
|
|
||||||
container_name: "executor-service-1"
|
|
||||||
index-service-2:
|
|
||||||
<<: *partition-2
|
|
||||||
image: "marginalia/index-service"
|
|
||||||
container_name: "index-service-2"
|
|
||||||
executor-service-2:
|
|
||||||
<<: *partition-2
|
|
||||||
image: "marginalia/executor-service"
|
|
||||||
container_name: "executor-service-2"
|
|
||||||
query-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/query-service"
|
|
||||||
container_name: "query-service"
|
|
||||||
search-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/search-service"
|
|
||||||
container_name: "search-service"
|
|
||||||
expose:
|
|
||||||
- 80
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.search-service.rule=PathPrefix(`/`)"
|
|
||||||
- "traefik.http.routers.search-service.entrypoints=search"
|
|
||||||
- "traefik.http.routers.search-service.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.search-service.middlewares=add-public"
|
|
||||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
|
||||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
|
||||||
assistant-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/assistant-service"
|
|
||||||
container_name: "assistant-service"
|
|
||||||
expose:
|
|
||||||
- 80
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.assistant-service-screenshot.rule=PathPrefix(`/screenshot`)"
|
|
||||||
- "traefik.http.routers.assistant-service-screenshot.entrypoints=search,dating"
|
|
||||||
- "traefik.http.routers.assistant-service-screenshot.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.assistant-service-screenshot.middlewares=add-public"
|
|
||||||
- "traefik.http.routers.assistant-service-suggest.rule=PathPrefix(`/suggest`)"
|
|
||||||
- "traefik.http.routers.assistant-service-suggest.entrypoints=search"
|
|
||||||
- "traefik.http.routers.assistant-service-suggest.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.assistant-service-suggest.middlewares=add-public"
|
|
||||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
|
||||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
|
||||||
api-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/api-service"
|
|
||||||
container_name: "api-service"
|
|
||||||
expose:
|
|
||||||
- "80"
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.api-service.rule=PathPrefix(`/`)"
|
|
||||||
- "traefik.http.routers.api-service.entrypoints=api"
|
|
||||||
- "traefik.http.routers.api-service.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.api-service.middlewares=add-public"
|
|
||||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
|
||||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
|
||||||
dating-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/dating-service"
|
|
||||||
container_name: "dating-service"
|
|
||||||
expose:
|
|
||||||
- 80
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.dating-service.rule=PathPrefix(`/`)"
|
|
||||||
- "traefik.http.routers.dating-service.entrypoints=dating"
|
|
||||||
- "traefik.http.routers.dating-service.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.dating-service.middlewares=add-public"
|
|
||||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
|
||||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
|
||||||
explorer-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/explorer-service"
|
|
||||||
container_name: "explorer-service"
|
|
||||||
expose:
|
|
||||||
- 80
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.explorer-service.rule=PathPrefix(`/`)"
|
|
||||||
- "traefik.http.routers.explorer-service.entrypoints=explore"
|
|
||||||
- "traefik.http.routers.explorer-service.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.explorer-service.middlewares=add-public"
|
|
||||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
|
||||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
|
||||||
control-service:
|
|
||||||
<<: *service
|
|
||||||
image: "marginalia/control-service"
|
|
||||||
container_name: "control-service"
|
|
||||||
expose:
|
|
||||||
- 80
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.control-service.rule=PathPrefix(`/`)"
|
|
||||||
- "traefik.http.routers.control-service.entrypoints=control"
|
|
||||||
- "traefik.http.routers.control-service.middlewares=add-xpublic"
|
|
||||||
- "traefik.http.routers.control-service.middlewares=add-public"
|
|
||||||
- "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1"
|
|
||||||
- "traefik.http.middlewares.add-public.addprefix.prefix=/public"
|
|
||||||
mariadb:
|
|
||||||
image: "mariadb:lts"
|
|
||||||
container_name: "mariadb"
|
|
||||||
env_file: "run/env/mariadb.env"
|
|
||||||
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
|
|
||||||
ports:
|
|
||||||
- "127.0.0.1:3306:3306/tcp"
|
|
||||||
healthcheck:
|
|
||||||
test: mysqladmin ping -h 127.0.0.1 -u $$MARIADB_USER --password=$$MARIADB_PASSWORD
|
|
||||||
start_period: 5s
|
|
||||||
interval: 5s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 60
|
|
||||||
volumes:
|
|
||||||
- db:/var/lib/mysql
|
|
||||||
- "./code/common/db/src/main/resources/sql/current/:/docker-entrypoint-initdb.d/"
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
traefik:
|
|
||||||
image: "traefik:v2.10"
|
|
||||||
container_name: "traefik"
|
|
||||||
command:
|
|
||||||
#- "--log.level=DEBUG"
|
|
||||||
- "--api.insecure=true"
|
|
||||||
- "--providers.docker=true"
|
|
||||||
- "--providers.docker.exposedbydefault=false"
|
|
||||||
- "--entrypoints.search.address=:80"
|
|
||||||
- "--entrypoints.control.address=:81"
|
|
||||||
- "--entrypoints.api.address=:82"
|
|
||||||
- "--entrypoints.dating.address=:83"
|
|
||||||
- "--entrypoints.explore.address=:84"
|
|
||||||
ports:
|
|
||||||
- "127.0.0.1:8080:80"
|
|
||||||
- "127.0.0.1:8081:81"
|
|
||||||
- "127.0.0.1:8082:82"
|
|
||||||
- "127.0.0.1:8083:83"
|
|
||||||
- "127.0.0.1:8084:84"
|
|
||||||
- "127.0.0.1:8090:8080"
|
|
||||||
volumes:
|
|
||||||
- "/var/run/docker.sock:/var/run/docker.sock:ro"
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
prometheus:
|
|
||||||
image: "prom/prometheus"
|
|
||||||
container_name: "prometheus"
|
|
||||||
command:
|
|
||||||
- "--config.file=/etc/prometheus/prometheus.yml"
|
|
||||||
ports:
|
|
||||||
- "127.0.0.1:8091:9090"
|
|
||||||
volumes:
|
|
||||||
- "./run/prometheus.yml:/etc/prometheus/prometheus.yml"
|
|
||||||
- "/var/run/docker.sock:/var/run/docker.sock:ro"
|
|
||||||
networks:
|
|
||||||
- wmsa
|
|
||||||
networks:
|
|
||||||
wmsa:
|
|
||||||
volumes:
|
|
||||||
db:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/db
|
|
||||||
logs:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/logs
|
|
||||||
model:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/model
|
|
||||||
conf:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/conf
|
|
||||||
data:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/data
|
|
||||||
samples-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/samples
|
|
||||||
index-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/index
|
|
||||||
work-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/work
|
|
||||||
backup-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/backup
|
|
||||||
uploads-1:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-1/uploads
|
|
||||||
samples-2:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-2/samples
|
|
||||||
index-2:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-2/index
|
|
||||||
work-2:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-2/work
|
|
||||||
backup-2:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-2/backup
|
|
||||||
uploads-2:
|
|
||||||
driver: local
|
|
||||||
driver_opts:
|
|
||||||
type: none
|
|
||||||
o: bind
|
|
||||||
device: run/node-2/uploads
|
|
@ -1,59 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -e
|
|
||||||
|
|
||||||
# Check if wget exists
|
|
||||||
if command -v wget &> /dev/null; then
|
|
||||||
dl_prg="wget -O"
|
|
||||||
elif command -v curl &> /dev/null; then
|
|
||||||
dl_prg="curl -o"
|
|
||||||
else
|
|
||||||
echo "Neither wget nor curl found, exiting .."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
case "$1" in
|
|
||||||
"s"|"m"|"l"|"xl")
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
echo "Invalid argument. Must be one of 's', 'm', 'l' or 'xl'."
|
|
||||||
exit 1
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
|
|
||||||
SAMPLE_NAME=crawl-${1:-m}
|
|
||||||
SAMPLE_DIR="node-1/samples/${SAMPLE_NAME}/"
|
|
||||||
|
|
||||||
function download_model {
|
|
||||||
model=$1
|
|
||||||
url=$2
|
|
||||||
|
|
||||||
if [ ! -f $model ]; then
|
|
||||||
echo "** Downloading $url"
|
|
||||||
$dl_prg $model $url
|
|
||||||
fi
|
|
||||||
}
|
|
||||||
|
|
||||||
pushd $(dirname $0)
|
|
||||||
|
|
||||||
if [ -d ${SAMPLE_DIR} ]; then
|
|
||||||
echo "${SAMPLE_DIR} already exists; remove it if you want to re-download the sample"
|
|
||||||
fi
|
|
||||||
|
|
||||||
mkdir -p node-1/samples/
|
|
||||||
SAMPLE_TARBALL=samples/${SAMPLE_NAME}.tar.gz
|
|
||||||
download_model ${SAMPLE_TARBALL}.tmp https://downloads.marginalia.nu/${SAMPLE_TARBALL} && mv ${SAMPLE_TARBALL}.tmp ${SAMPLE_TARBALL}
|
|
||||||
|
|
||||||
if [ ! -f ${SAMPLE_TARBALL} ]; then
|
|
||||||
echo "!! Failed"
|
|
||||||
exit 255
|
|
||||||
fi
|
|
||||||
|
|
||||||
mkdir -p ${SAMPLE_DIR}
|
|
||||||
tar zxf ${SAMPLE_TARBALL} --strip-components=1 -C ${SAMPLE_DIR}
|
|
||||||
|
|
||||||
cat > "${SAMPLE_DIR}/marginalia-manifest.json" <<EOF
|
|
||||||
{ "description": "Sample data set ${SAMPLE_NAME}", "type": "CRAWL_DATA" }
|
|
||||||
EOF
|
|
||||||
|
|
||||||
popd
|
|
131
run/readme.md
@ -1,8 +1,10 @@
|
|||||||
# Run
|
# Run
|
||||||
|
|
||||||
When developing locally, this directory will contain run-time data required for
|
This directory is a staging area for running the system. It contains scripts
|
||||||
the search engine. In a clean check-out, it only contains the tools required to
|
and templates for installing the system on a server, and for running it locally.
|
||||||
bootstrap this directory structure.
|
|
||||||
|
See [https://docs.marginalia.nu/](https://docs.marginalia.nu/) for additional
|
||||||
|
documentation.
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
@ -16,8 +18,7 @@ graalce is a good distribution choice but it doesn't matter too much.
|
|||||||
## Set up
|
## Set up
|
||||||
|
|
||||||
To go from a clean check out of the git repo to a running search engine,
|
To go from a clean check out of the git repo to a running search engine,
|
||||||
follow these steps. This assumes a test deployment. For a production like
|
follow these steps.
|
||||||
setup... (TODO: write a guide for this).
|
|
||||||
|
|
||||||
You're assumed to sit in the project root the whole time.
|
You're assumed to sit in the project root the whole time.
|
||||||
|
|
||||||
@ -35,106 +36,40 @@ $ run/setup.sh
|
|||||||
```shell
|
```shell
|
||||||
$ ./gradlew docker
|
$ ./gradlew docker
|
||||||
```
|
```
|
||||||
|
### 3. Install the system
|
||||||
### 3. Initialize the database
|
|
||||||
|
|
||||||
Before the system can be brought online, the database needs to be initialized. To do this,
|
|
||||||
bring up the database in the background, and run the flyway migration tool.
|
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
$ docker-compose up -d mariadb
|
$ run/install.sh <install-directory>
|
||||||
$ ./gradlew flywayMigrate
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Bring the system online.
|
To install the system, you need to run the install script. It will prompt
|
||||||
|
you for which installation mode you want to use. The options are:
|
||||||
|
|
||||||
We'll run it in the foreground in the terminal this time because it's educational to see the logs.
|
1. Barebones - This will install a white-label search engine with no data. You can
|
||||||
Add `-d` to run in the background.
|
use this to index your own data. It disables and hides functionality that is strongly
|
||||||
|
related to the Marginalia project, such as the Marginalia GUI.
|
||||||
|
2. Full Marginalia Search instance - This will install an instance of the search engine
|
||||||
|
configured like [search.marginalia.nu](https://search.marginalia.nu). This is useful
|
||||||
|
for local development and testing.
|
||||||
|
|
||||||
|
It will also prompt you for account details for a new mariadb instance, which will be
|
||||||
|
created for you. The database will be initialized with the schema and data required
|
||||||
|
for the search engine to run.
|
||||||
|
|
||||||
|
After filling out all the details, the script will copy the installation files to the
|
||||||
|
specified directory.
|
||||||
|
|
||||||
|
### 4. Run the system
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
$ docker-compose up
|
$ cd install_directory
|
||||||
|
$ docker-compose up -d
|
||||||
|
# To see the logs:
|
||||||
|
$ docker-compose logs -f
|
||||||
```
|
```
|
||||||
|
|
||||||
There are two docker-compose files available, `docker-compose.yml` and `docker-compose-barebones.yml`;
|
You can now access a search interface at `http://localhost:8080`, and the admin interface
|
||||||
the latter is a stripped down version that only runs the bare minimum required to run the system, for e.g.
|
at `http://localhost:8081/`.
|
||||||
running a whitelabel version of the system. The former is the full system with all the frills of
|
|
||||||
Marginalia Search, and is the one used by default.
|
|
||||||
|
|
||||||
To start the barebones version, run:
|
There is no data in the system yet. To load data into the system,
|
||||||
|
see the guide at [https://docs.marginalia.nu/](https://docs.marginalia.nu/).
|
||||||
```shell
|
|
||||||
$ docker-compose -f docker-compose-barebones.yml up
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. You should now be able to access the system.
|
|
||||||
|
|
||||||
By default, the docker-compose file publishes the following ports:
|
|
||||||
|
|
||||||
| Address | Description |
|
|
||||||
|-------------------------|------------------|
|
|
||||||
| http://localhost:8080/ | User-facing GUI |
|
|
||||||
| http://localhost:8081/ | Operator's GUI |
|
|
||||||
|
|
||||||
Note that the operator's GUI does not perform any sort of authentication.
|
|
||||||
Preferably don't expose it publicly, but if you absolutely must, use a proxy or
|
|
||||||
Basic Auth to add security.
|
|
||||||
|
|
||||||
### 6. Download Sample Data
|
|
||||||
|
|
||||||
A script is available for downloading sample data. The script will download the
|
|
||||||
data from https://downloads.marginalia.nu/ and extract it to the correct location.
|
|
||||||
|
|
||||||
The system will pick the data up automatically.
|
|
||||||
|
|
||||||
```shell
|
|
||||||
$ run/download-samples.sh l
|
|
||||||
```
|
|
||||||
|
|
||||||
Four sets are available:
|
|
||||||
|
|
||||||
| Name | Description |
|
|
||||||
|------|---------------------------------|
|
|
||||||
| s | Small set, 1000 domains |
|
|
||||||
| m | Medium set, 2000 domains |
|
|
||||||
| l | Large set, 5000 domains |
|
|
||||||
| xl | Extra large set, 50,000 domains |
|
|
||||||
|
|
||||||
Warning: The XL set is intended to provide a large amount of data for
|
|
||||||
setting up a pre-production environment. It may be hard to run on a smaller
|
|
||||||
machine and will on most machines take several hours to process.
|
|
||||||
|
|
||||||
The 'm' or 'l' sets are a good compromise between size and processing time
|
|
||||||
and should work on most machines.
|
|
||||||
|
|
||||||
### 7. Process the data
|
|
||||||
|
|
||||||
Bring the system online if it isn't (see step 4), then go to the operator's
|
|
||||||
GUI (see step 5).
|
|
||||||
|
|
||||||
* Go to `Node 1 -> Storage -> Crawl Data`
|
|
||||||
* Hit the toggle to set your crawl data to be active
|
|
||||||
* Go to `Actions -> Process Crawl Data -> [Trigger Reprocessing]`
|
|
||||||
|
|
||||||
This will take anywhere between a few minutes to a few hours depending on which
|
|
||||||
data set you downloaded. You can monitor the progress from the `Overview` tab.
|
|
||||||
|
|
||||||
First the CONVERTER is expected to run; this will process the data into a format
|
|
||||||
that can easily be inserted into the database and index.
|
|
||||||
|
|
||||||
Next the LOADER will run; this will insert the data into the database and index.
|
|
||||||
|
|
||||||
Next the link database will repartition itself, and finally the index will be
|
|
||||||
reconstructed. You can view the process of these steps in the `Jobs` listing.
|
|
||||||
|
|
||||||
### 8. Run the system
|
|
||||||
|
|
||||||
Once all this is done, you can go to the user-facing GUI (see step 5) and try
|
|
||||||
a search.
|
|
||||||
|
|
||||||
Important! Use the 'No Ranking' option when running locally, since you'll very
|
|
||||||
likely not have enough links for the ranking algorithm to perform well.
|
|
||||||
|
|
||||||
## Experiment Runner
|
|
||||||
|
|
||||||
The script `experiment.sh` is a launcher for the experiment runner, which is useful when
|
|
||||||
evaluating new algorithms in processing crawl data.
|
|
||||||
|