diff --git a/README.md b/README.md index 58f84e55..6a53f817 100644 --- a/README.md +++ b/README.md @@ -13,12 +13,21 @@ The long term plan is to refine the search engine so that it provide enough publ that the project can be funded through grants, donations and commercial API licenses (non-commercial share-alike is always free). +The system can both be run as a copy of Marginalia Search, or as a white-label search engine +for your own data (either crawled or side-loaded). At present the logic isn't very configurable, and a lot of the judgements +made are based on the Marginalia project's goals, but additional configurability is being +worked on! + ## Set up -Start by running [⚙️ run/setup.sh](run/setup.sh). This will download supplementary model data that is necessary to run the code. +To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)! + +Further documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/). + +Before compiling, it's necessary to run [⚙️ run/setup.sh](run/setup.sh). +This will download supplementary model data that is necessary to run the code. These are also necessary to run the tests. -To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)! ## Hardware Requirements diff --git a/doc/crawling.md b/doc/crawling.md deleted file mode 100644 index c1086ac5..00000000 --- a/doc/crawling.md +++ /dev/null @@ -1,115 +0,0 @@ -# Crawling - -## WARNING - -Please don't run the crawler unless you intend to actually operate a public -facing search engine! For testing, use crawl sets from [downloads.marginalia.nu](https://downloads.marginalia.nu/) instead; -or if you wish to play with the crawler, crawl a small set of domains from people who are -ok with it, use your own, your friends, or any subdomain from marginalia.nu. - -See the documentation in run/ for more information on how to load sample data! - -Reckless crawling annoys webmasters and makes it harder to run an independent search engine. -Crawling from a domestic IP address is also likely to put you on a greylist -of probable bots. You will solve CAPTCHAs for almost every website you visit -for weeks, and may be permanently blocked from a few IPs. - -## Prerequisites - -You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of -DNS traffic. - -These processes require a lot of disk space. It's strongly recommended to use a dedicated disk for -the index storage subdirectory, it doesn't need to be extremely fast, but it should be a few terabytes in size. - -It should be mounted with `noatime`. It may be a good idea to format the disk with a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler. - -Make sure you configure the user-agent properly. This will be used to identify the crawler, -and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it. -See [wiki://Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard) for more information -about robots.txt; the user agent can be configured in conf/properties/system.properties; see the -[system-properties](system-properties.md) documentation for more information. - -## Setup - -Ensure that the system is running and go to https://localhost:8081. - -With the default test configuration, the system is configured to -store data in `node-1/storage`. - -## Fresh Crawl - -While a running search engine can use the link database to figure out which websites to visit, a clean -system does not know of any links. To bootstrap a crawl, a crawl specification needs to be created to -seed the domain database. - -Go to `Nodes->Node 1->Actions->New Crawl` - -![img](images/new_crawl.png) - -Click the link that says 'New Spec' to arrive at a form for creating a new specification: - -![img](images/new_spec.png) - -Fill out the form with a description and a link to a domain list. The domain list is a text file -with one domain per line, with blank lines and comments starting with `#` ignored. You can use -github raw links for this purpose. For test purposes, you can use this link: -`https://downloads.marginalia.nu/domain-list-test.txt`, which will create a crawl for a few -of marignalia.nu's subdomains. - -If you aren't redirected there automatically, go back to the `New Crawl` page under Node 1 -> Actions. -Your new specification should now be listed. - -Check the box next to it, and click `[Trigger New Crawl]`. - -![img](images/new_crawl2.png) - -This will start the crawling process. Crawling may take a while, depending on the size -of the domain list and the size of the websites. - -![img](images/crawl_in_progress.png) - -Eventually a process bar will show up, and the crawl will start. When it reaches 100%, the crawl is done. -You can also monitor the `Events Summary` table on the same page to see what happened after the fact. - -It is expected that the crawl will stall out toward the end of the process, this is a statistical effect since -the largest websites take the longest to finish, and tend to be the ones lingering at 99% or so completion. The -crawler has a timeout of 5 hours, where if no new domains are finished crawling, it will stop, to prevent crawler traps -from stalling the crawl indefinitely. - -**Be sure to read the section on re-crawling!** - -## Converting - -Once the crawl is done, the data needs to be processed before its searchable. This is done by going to -`Nodes->Node 1->Actions->Process Crawl Data`. - -![Conversion screenshot](images/convert.png) - -This will start the conversion process. This will again take a while, depending on the size of the crawl. -The process bar will show the progress. When it reaches 100%, the conversion is done, and the data will begin -loading automatically. A cascade of actions is performed in sequence, leading to the data being loaded into the -search engine and an index being constructed. This is all automatic, but depending on the size of the crawl data, -may take a while. - -When an event `INDEX-SWITCH-OK` is logged in the `Event Summary` table, the data is ready to be searched. - -## Re-crawling - -The work flow with a crawl spec was a one-off process to bootstrap the search engine. To keep the search engine up to date, -it is preferable to do a recrawl. This will try to reduce the amount of data that needs to be fetched. - -To trigger a Recrawl, go to `Nodes->Node 1->Actions->Re-crawl`. This will bring you to a page that looks similar to the -first crawl page, where you can select a set of crawl data to use as a source. Select the crawl data you want, and -press `[Trigger Recrawl]`. - -Crawling will proceed as before, but this time, the crawler will try to fetch only the data that has changed since the -last crawl, increasing the number of documents by a percentage. This will typically be much faster than the initial crawl. - -### Growing the crawl set - -The re-crawl will also pull new domains from the `New Domains` dataset, which is an URL configurable in -`[Top Menu] -> System -> Data Sets`. If a new domain is found, it will be assigned to the present node, and crawled in -the re-crawl. - -![Datasets screenshot](images/datasets.png) diff --git a/doc/images/convert.png b/doc/images/convert.png deleted file mode 100644 index 82ec7343..00000000 Binary files a/doc/images/convert.png and /dev/null differ diff --git a/doc/images/convert_2.png b/doc/images/convert_2.png deleted file mode 100644 index c5adb27d..00000000 Binary files a/doc/images/convert_2.png and /dev/null differ diff --git a/doc/images/crawl_in_progress.png b/doc/images/crawl_in_progress.png deleted file mode 100644 index ceb39056..00000000 Binary files a/doc/images/crawl_in_progress.png and /dev/null differ diff --git a/doc/images/datasets.png b/doc/images/datasets.png deleted file mode 100644 index a5bf0d87..00000000 Binary files a/doc/images/datasets.png and /dev/null differ diff --git a/doc/images/load_warc.png b/doc/images/load_warc.png deleted file mode 100644 index 5e0cedde..00000000 Binary files a/doc/images/load_warc.png and /dev/null differ diff --git a/doc/images/new_crawl.png b/doc/images/new_crawl.png deleted file mode 100644 index ae905cd6..00000000 Binary files a/doc/images/new_crawl.png and /dev/null differ diff --git a/doc/images/new_crawl2.png b/doc/images/new_crawl2.png deleted file mode 100644 index cc85acbe..00000000 Binary files a/doc/images/new_crawl2.png and /dev/null differ diff --git a/doc/images/new_spec.png b/doc/images/new_spec.png deleted file mode 100644 index 8b466e87..00000000 Binary files a/doc/images/new_spec.png and /dev/null differ diff --git a/doc/images/sideload_menu.png b/doc/images/sideload_menu.png deleted file mode 100644 index 6a85d076..00000000 Binary files a/doc/images/sideload_menu.png and /dev/null differ diff --git a/doc/images/sideload_warc.png b/doc/images/sideload_warc.png deleted file mode 100644 index dd763efc..00000000 Binary files a/doc/images/sideload_warc.png and /dev/null differ diff --git a/doc/readme.md b/doc/readme.md index bbc64105..082b14e7 100644 --- a/doc/readme.md +++ b/doc/readme.md @@ -3,12 +3,11 @@ A lot of the architectural description is sprinkled into the code repository closer to the code. Start in [📁 ../code/](../code/) and poke around. +Operational documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/). + ## Operations -* [System Properties](system-properties.md) - JVM property flags - ## How-To -* [Sideloading How-To](sideloading-howto.md) - How to sideload various data sets * [Parquet How-To](parquet-howto.md) - Useful tips in working with Parquet files ## Set-up diff --git a/doc/sideloading-howto.md b/doc/sideloading-howto.md deleted file mode 100644 index 93a44981..00000000 --- a/doc/sideloading-howto.md +++ /dev/null @@ -1,211 +0,0 @@ -# Sideloading How-To - -Some websites are much larger than others, this includes -Wikipedia, Stack Overflow, and a few others. They are so -large they are impractical to crawl in the traditional fashion, -but luckily they make available data dumps that can be processed -and loaded into the search engine through other means. - -To this end, it's possible to sideload data into the search engine -from other sources than the web crawler. - -## Index Nodes - -In practice, if you want to sideload data, you need to do it on -a separate index node. Index nodes are separate instances of the -index software. The default configuration is to have two index nodes, -one for the web crawler, and one for sideloaded data. - -The need for a separate node is due to incompatibilities in the work flows. - -It is also a good idea in general, as very large domains can easily be so large that the entire time budget -for the query is spent sifting through documents from that one domain, this is -especially true with something like Wikipedia, which has a lot of documents at -least tangentially related to any given topic. - -This how-to assumes that you are operating on index-node 2. - -## Notes on the upload directory - -This is written assuming that the system is installed with the `install.sh` -script, which deploys the system with docker-compose, and has a directory -structure like - -``` -... -index-1/backup/ -index-1/index/ -index-1/storage/ -index-1/uploads/ -index-1/work/ -index-2/backup/ -index-2/index/ -index-2/storage/ -index-2/uploads/ -index-2/work/ -... -``` - -We're going to be putting files in the **uploads** directories. If you have installed -the system in some other way, or changed the configuration significantly, you need -to adjust the paths accordingly. - -## Sideloading - -The sideloading actions are available through Actions menu in each node. - -![Sideload menu](images/sideload_menu.png) - -## Sideloading WARCs - -WARC files are the standard format for web archives. They can be created e.g. with wget. -The Marginalia software can read WARC files directly, and sideload them into the index, -as long as each warc file contains only one domain. - -Let's for example archive www.marginalia.nu (I own this domain, so feel free to try this at home) - -```bash -$ wget -r --warc-file=marginalia www.marginalia.nu -``` - -**Note** If you intend to do this on other websites, you should probably add a `--wait` parameter to wget, -e.g. `wget --wait=1 -r --warc-file=...` to avoid hammering the website with requests and getting blocked. - -This will take a moment, and create a file called `marginalia.warc.gz`. We move it to the -upload directory of the index node, and sideload it through the Actions menu. - -```bash -$ mkdir -p index-2/uploads/marginalia-warc -$ mv marginalia.warc.gz index-2/uploads/marginalia-warc -``` - -Go to the Actions menu, and select the "Sideload WARC" action. This will show a list of -subdirectories in the Uploads directory. Select the directory containing the WARC file, and -click "Sideload". - -![Sideload WARC screenshot](images/sideload_warc.png) - -This should take you to the node overview, where you can see the progress of the sideloading. -It will take a moment, as the WARC file is being processed. - -![Processing in progress](images/convert_2.png) - -It will not be loaded automatically. This is to permit you to sideload multiple sources. - -When you are ready to load it, go to the Actions menu, and select "Load Crawl Data". - -![Load Crawl Data](images/load_warc.png) - -Select all the sources you want to load, and click "Load". This will load the data into the -index, and make it available for searching. - -## Sideloading Wikipedia - -Due to licensing incompatibilities with OpenZim's GPL-2 and AGPL, the workflow -depends on using the conversion process from [https://encyclopedia.marginalia.nu/](https://encyclopedia.marginalia.nu/) -to pre-digest the data. - -Build the [encyclopedia.marginalia.nu Code](https://github.com/MarginaliaSearch/encyclopedia.marginalia.nu) -and follow the instructions for downloading a ZIM file, and then run something like - -```$./encyclopedia convert file.zim articles.db``` - -This db-file can be processed and loaded into the search engine through the -Actions view. - -FIXME: It will currently only point to en.wikipedia.org, this should be -made configurable. - - -## Sideloading a directory tree - -For relatively small websites, ad-hoc side-loading is available directly from a -folder structure on the hard drive. This is intended for loading manuals, -documentation and similar data sets that are large and slowly changing. - -A website can be archived with wget, like this - -```bash -UA="search.marginalia.nu" \ -DOMAIN="www.example.com" \ -wget -nc -x --continue -w 1 -r -U ${UA} -A "html" ${DOMAIN} -``` - -After doing this to a bunch of websites, create a YAML file something like this: - -```yaml -sources: -- name: jdk-20 - dir: "jdk-20/" - domainName: "docs.oracle.com" - baseUrl: "https://docs.oracle.com/en/java/javase/20/docs" - keywords: - - "java" - - "docs" - - "documentation" - - "javadoc" -- name: python3 - dir: "python-3.11.5/" - domainName: "docs.python.org" - baseUrl: "https://docs.python.org/3/" - keywords: - - "python" - - "docs" - - "documentation" -- name: mariadb.com - dir: "mariadb.com/" - domainName: "mariadb.com" - baseUrl: "https://mariadb.com/" - keywords: - - "sql" - - "docs" - - "mariadb" - - "mysql" -``` - -|parameter|description| -|----|----| -|name|Purely informative| -|dir|Path of website contents relative to the location of the yaml file| -|domainName|The domain name of the website| -|baseUrl|This URL will be prefixed to the contents of `dir`| -|keywords|These supplemental keywords will be injected in each document| - -The directory structure corresponding to the above might look like - -``` -docs-index.yaml -jdk-20/ -jdk-20/resources/ -jdk-20/api/ -jdk-20/api/[...] -jdk-20/specs/ -jdk-20/specs/[...] -jdk-20/index.html -mariadb.com -mariadb.com/kb/ -mariadb.com/kb/[...] -python-3.11.5 -python-3.11.5/genindex-B.html -python-3.11.5/library/ -python-3.11.5/distutils/ -python-3.11.5/[...] -[...] -``` - -This yaml-file can be processed and loaded into the search engine through the -Actions view. - - -## Sideloading Stack Overflow/Stackexchange - -Stackexchange makes dumps available on Archive.org. These are unfortunately on a format that -needs some heavy-handed pre-processing before they can be loaded. A tool is available for -this in [tools/stackexchange-converter](../code/tools/stackexchange-converter). - -After running `gradlew dist`, this tool is found in `build/dist/stackexchange-converter`, -follow the instructions in the stackexchange-converter readme, and -convert the stackexchange xml.7z-files to sqlite db-files. - -A directory with such db-files can be processed and loaded into the -search engine through the Actions view. \ No newline at end of file diff --git a/doc/system-properties.md b/doc/system-properties.md deleted file mode 100644 index 0c825e7b..00000000 --- a/doc/system-properties.md +++ /dev/null @@ -1,42 +0,0 @@ -# System Properties - -These are JVM system properties used by each service. These properties can either -be loaded from a file or passed in as command line arguments, using `$JAVA_OPTS`. - -The system will look for a properties file in `conf/properties/system.properties`, -within the install dir, as specified by `$WMSA_HOME`. - -A template is available in [../run/template/conf/properties/system.properties](../run/template/conf/properties/system.properties). - -## Global - -| flag | values | description | -|-------------|------------|--------------------------------------| -| blacklist.disable | boolean | Disables the IP blacklist | -| flyway.disable | boolean | Disables automatic Flyway migrations | - -## Crawler Properties - -| flag | values | description | -|------------------------------|------------|---------------------------------------------------------------------------------------------| -| crawler.userAgentString | string | Sets the user agent string used by the crawler | -| crawler.userAgentIdentifier | string | Sets the user agent identifier used by the crawler, e.g. what it looks for in robots.txt | -| crawler.poolSize | integer | Sets the number of threads used by the crawler, more is faster, but uses more RAM | -| crawler.initialUrlsPerDomain | integer | Sets the initial number of URLs to crawl per domain (when crawling from spec) | -| crawler.maxUrlsPerDomain | integer | Sets the maximum number of URLs to crawl per domain (when recrawling) | -| crawler.minUrlsPerDomain | integer | Sets the minimum number of URLs to crawl per domain (when recrawling) | -| crawler.crawlSetGrowthFactor | double | If 100 documents were fetched last crawl, increase the goal to 100 x (this value) this time | -| ip-blocklist.disabled | boolean | Disables the IP blocklist | - -## Converter Properties - -| flag | values | description | -|-----------------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------| -| converter.sideloadThreshold | integer | Threshold value, in number of documents per domain, where a simpler processing method is used which uses less RAM. 10,000 is a good value for ~32GB RAM | - -# Marginalia Application Specific - -| flag | values | description | -|---------------------------|------------|---------------------------------------------------------------| -| search.websiteUrl | string | Overrides the website URL used in rendering | -| control.hideMarginaliaApp | boolean | Hides the Marginalia application from the control GUI results | diff --git a/docker-compose-barebones.yml b/docker-compose-barebones.yml deleted file mode 100644 index 9f1b3783..00000000 --- a/docker-compose-barebones.yml +++ /dev/null @@ -1,181 +0,0 @@ -# This is the barebones docker-compose file for the Marginalia Search Engine. -# -# It starts a stripped-down version of the search engine, with only the essential -# services running, including the database, the query service, the control service, -# and a single index and executor node. -# -# It is a good starting point for setting up a white-label search engine that does not -# have Marginalia's GUI. The Query Service presents a simple search box, that also talks -# JSON, so you can use it as a backend for your own search interface. - - -x-svc: &service - env_file: - - "run/env/service.env" - volumes: - - conf:/wmsa/conf:ro - - model:/wmsa/model - - data:/wmsa/data - - logs:/var/log/wmsa - networks: - - wmsa - depends_on: - - mariadb - labels: - - "__meta_docker_port_private=7000" -x-p1: &partition-1 - env_file: - - "run/env/service.env" - volumes: - - conf:/wmsa/conf:ro - - model:/wmsa/model - - data:/wmsa/data - - logs:/var/log/wmsa - - index-1:/idx - - work-1:/work - - backup-1:/backup - - samples-1:/storage - - uploads-1:/uploads - networks: - - wmsa - depends_on: - - mariadb - environment: - - "WMSA_SERVICE_NODE=1" - -services: - index-service-1: - <<: *partition-1 - image: "marginalia/index-service" - container_name: "index-service-1" - executor-service-1: - <<: *partition-1 - image: "marginalia/executor-service" - container_name: "executor-service-1" - query-service: - <<: *service - image: "marginalia/query-service" - container_name: "query-service" - expose: - - 80 - labels: - - "traefik.enable=true" - - "traefik.http.routers.search-service.rule=PathPrefix(`/`)" - - "traefik.http.routers.search-service.entrypoints=search" - - "traefik.http.routers.search-service.middlewares=add-xpublic" - - "traefik.http.routers.search-service.middlewares=add-public" - - "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1" - - "traefik.http.middlewares.add-public.addprefix.prefix=/public" - control-service: - <<: *service - image: "marginalia/control-service" - container_name: "control-service" - expose: - - 80 - labels: - - "traefik.enable=true" - - "traefik.http.routers.control-service.rule=PathPrefix(`/`)" - - "traefik.http.routers.control-service.entrypoints=control" - - "traefik.http.routers.control-service.middlewares=add-xpublic" - - "traefik.http.routers.control-service.middlewares=add-public" - - "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1" - - "traefik.http.middlewares.add-public.addprefix.prefix=/public" - mariadb: - image: "mariadb:lts" - container_name: "mariadb" - env_file: "run/env/mariadb.env" - command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci'] - ports: - - "127.0.0.1:3306:3306/tcp" - healthcheck: - test: mysqladmin ping -h 127.0.0.1 -u $$MARIADB_USER --password=$$MARIADB_PASSWORD - start_period: 5s - interval: 5s - timeout: 5s - retries: 60 - volumes: - - db:/var/lib/mysql - - "./code/common/db/src/main/resources/sql/current/:/docker-entrypoint-initdb.d/" - networks: - - wmsa - traefik: - image: "traefik:v2.10" - container_name: "traefik" - command: - #- "--log.level=DEBUG" - - "--api.insecure=true" - - "--providers.docker=true" - - "--providers.docker.exposedbydefault=false" - - "--entrypoints.search.address=:80" - - "--entrypoints.control.address=:81" - ports: - - "127.0.0.1:8080:80" - - "127.0.0.1:8081:81" - - "127.0.0.1:8090:8080" - volumes: - - "/var/run/docker.sock:/var/run/docker.sock:ro" - networks: - - wmsa -networks: - wmsa: -volumes: - db: - driver: local - driver_opts: - type: none - o: bind - device: run/db - logs: - driver: local - driver_opts: - type: none - o: bind - device: run/logs - model: - driver: local - driver_opts: - type: none - o: bind - device: run/model - conf: - driver: local - driver_opts: - type: none - o: bind - device: run/conf - data: - driver: local - driver_opts: - type: none - o: bind - device: run/data - samples-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/samples - index-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/index - work-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/work - backup-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/backup - uploads-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/uploads \ No newline at end of file diff --git a/docker-compose.yml b/docker-compose.yml deleted file mode 100644 index 63c54f7f..00000000 --- a/docker-compose.yml +++ /dev/null @@ -1,315 +0,0 @@ -# This is the full docker-compose.yml file for the Marginalia Search Engine. -# -# It starts all the services, including the GUI, the database, the query service, -# two nodes for demo purposes, as well as a bunch of peripheral services that are -# application specific. -# - -x-svc: &service - env_file: - - "run/env/service.env" - volumes: - - conf:/wmsa/conf:ro - - model:/wmsa/model - - data:/wmsa/data - - logs:/var/log/wmsa - networks: - - wmsa - labels: - - "__meta_docker_port_private=7000" -x-p1: &partition-1 - env_file: - - "run/env/service.env" - volumes: - - conf:/wmsa/conf:ro - - model:/wmsa/model - - data:/wmsa/data - - logs:/var/log/wmsa - - index-1:/idx - - work-1:/work - - backup-1:/backup - - samples-1:/storage - - uploads-1:/uploads - networks: - - wmsa - depends_on: - - mariadb - environment: - - "WMSA_SERVICE_NODE=1" -x-p2: &partition-2 - env_file: - - "run/env/service.env" - volumes: - - conf:/wmsa/conf:ro - - model:/wmsa/model - - data:/wmsa/data - - logs:/var/log/wmsa - - index-2:/idx - - work-2:/work - - backup-2:/backup - - samples-2:/storage - - uploads-2:/uploads - networks: - - wmsa - depends_on: - mariadb: - condition: service_healthy - environment: - - "WMSA_SERVICE_NODE=2" - -services: - index-service-1: - <<: *partition-1 - image: "marginalia/index-service" - container_name: "index-service-1" - executor-service-1: - <<: *partition-1 - image: "marginalia/executor-service" - container_name: "executor-service-1" - index-service-2: - <<: *partition-2 - image: "marginalia/index-service" - container_name: "index-service-2" - executor-service-2: - <<: *partition-2 - image: "marginalia/executor-service" - container_name: "executor-service-2" - query-service: - <<: *service - image: "marginalia/query-service" - container_name: "query-service" - search-service: - <<: *service - image: "marginalia/search-service" - container_name: "search-service" - expose: - - 80 - labels: - - "traefik.enable=true" - - "traefik.http.routers.search-service.rule=PathPrefix(`/`)" - - "traefik.http.routers.search-service.entrypoints=search" - - "traefik.http.routers.search-service.middlewares=add-xpublic" - - "traefik.http.routers.search-service.middlewares=add-public" - - "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1" - - "traefik.http.middlewares.add-public.addprefix.prefix=/public" - assistant-service: - <<: *service - image: "marginalia/assistant-service" - container_name: "assistant-service" - expose: - - 80 - labels: - - "traefik.enable=true" - - "traefik.http.routers.assistant-service-screenshot.rule=PathPrefix(`/screenshot`)" - - "traefik.http.routers.assistant-service-screenshot.entrypoints=search,dating" - - "traefik.http.routers.assistant-service-screenshot.middlewares=add-xpublic" - - "traefik.http.routers.assistant-service-screenshot.middlewares=add-public" - - "traefik.http.routers.assistant-service-suggest.rule=PathPrefix(`/suggest`)" - - "traefik.http.routers.assistant-service-suggest.entrypoints=search" - - "traefik.http.routers.assistant-service-suggest.middlewares=add-xpublic" - - "traefik.http.routers.assistant-service-suggest.middlewares=add-public" - - "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1" - - "traefik.http.middlewares.add-public.addprefix.prefix=/public" - api-service: - <<: *service - image: "marginalia/api-service" - container_name: "api-service" - expose: - - "80" - labels: - - "traefik.enable=true" - - "traefik.http.routers.api-service.rule=PathPrefix(`/`)" - - "traefik.http.routers.api-service.entrypoints=api" - - "traefik.http.routers.api-service.middlewares=add-xpublic" - - "traefik.http.routers.api-service.middlewares=add-public" - - "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1" - - "traefik.http.middlewares.add-public.addprefix.prefix=/public" - dating-service: - <<: *service - image: "marginalia/dating-service" - container_name: "dating-service" - expose: - - 80 - labels: - - "traefik.enable=true" - - "traefik.http.routers.dating-service.rule=PathPrefix(`/`)" - - "traefik.http.routers.dating-service.entrypoints=dating" - - "traefik.http.routers.dating-service.middlewares=add-xpublic" - - "traefik.http.routers.dating-service.middlewares=add-public" - - "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1" - - "traefik.http.middlewares.add-public.addprefix.prefix=/public" - explorer-service: - <<: *service - image: "marginalia/explorer-service" - container_name: "explorer-service" - expose: - - 80 - labels: - - "traefik.enable=true" - - "traefik.http.routers.explorer-service.rule=PathPrefix(`/`)" - - "traefik.http.routers.explorer-service.entrypoints=explore" - - "traefik.http.routers.explorer-service.middlewares=add-xpublic" - - "traefik.http.routers.explorer-service.middlewares=add-public" - - "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1" - - "traefik.http.middlewares.add-public.addprefix.prefix=/public" - control-service: - <<: *service - image: "marginalia/control-service" - container_name: "control-service" - expose: - - 80 - labels: - - "traefik.enable=true" - - "traefik.http.routers.control-service.rule=PathPrefix(`/`)" - - "traefik.http.routers.control-service.entrypoints=control" - - "traefik.http.routers.control-service.middlewares=add-xpublic" - - "traefik.http.routers.control-service.middlewares=add-public" - - "traefik.http.middlewares.add-xpublic.headers.customrequestheaders.X-Public=1" - - "traefik.http.middlewares.add-public.addprefix.prefix=/public" - mariadb: - image: "mariadb:lts" - container_name: "mariadb" - env_file: "run/env/mariadb.env" - command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci'] - ports: - - "127.0.0.1:3306:3306/tcp" - healthcheck: - test: mysqladmin ping -h 127.0.0.1 -u $$MARIADB_USER --password=$$MARIADB_PASSWORD - start_period: 5s - interval: 5s - timeout: 5s - retries: 60 - volumes: - - db:/var/lib/mysql - - "./code/common/db/src/main/resources/sql/current/:/docker-entrypoint-initdb.d/" - networks: - - wmsa - traefik: - image: "traefik:v2.10" - container_name: "traefik" - command: - #- "--log.level=DEBUG" - - "--api.insecure=true" - - "--providers.docker=true" - - "--providers.docker.exposedbydefault=false" - - "--entrypoints.search.address=:80" - - "--entrypoints.control.address=:81" - - "--entrypoints.api.address=:82" - - "--entrypoints.dating.address=:83" - - "--entrypoints.explore.address=:84" - ports: - - "127.0.0.1:8080:80" - - "127.0.0.1:8081:81" - - "127.0.0.1:8082:82" - - "127.0.0.1:8083:83" - - "127.0.0.1:8084:84" - - "127.0.0.1:8090:8080" - volumes: - - "/var/run/docker.sock:/var/run/docker.sock:ro" - networks: - - wmsa - prometheus: - image: "prom/prometheus" - container_name: "prometheus" - command: - - "--config.file=/etc/prometheus/prometheus.yml" - ports: - - "127.0.0.1:8091:9090" - volumes: - - "./run/prometheus.yml:/etc/prometheus/prometheus.yml" - - "/var/run/docker.sock:/var/run/docker.sock:ro" - networks: - - wmsa -networks: - wmsa: -volumes: - db: - driver: local - driver_opts: - type: none - o: bind - device: run/db - logs: - driver: local - driver_opts: - type: none - o: bind - device: run/logs - model: - driver: local - driver_opts: - type: none - o: bind - device: run/model - conf: - driver: local - driver_opts: - type: none - o: bind - device: run/conf - data: - driver: local - driver_opts: - type: none - o: bind - device: run/data - samples-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/samples - index-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/index - work-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/work - backup-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/backup - uploads-1: - driver: local - driver_opts: - type: none - o: bind - device: run/node-1/uploads - samples-2: - driver: local - driver_opts: - type: none - o: bind - device: run/node-2/samples - index-2: - driver: local - driver_opts: - type: none - o: bind - device: run/node-2/index - work-2: - driver: local - driver_opts: - type: none - o: bind - device: run/node-2/work - backup-2: - driver: local - driver_opts: - type: none - o: bind - device: run/node-2/backup - uploads-2: - driver: local - driver_opts: - type: none - o: bind - device: run/node-2/uploads \ No newline at end of file diff --git a/run/download-samples.sh b/run/download-samples.sh deleted file mode 100755 index bbae77e6..00000000 --- a/run/download-samples.sh +++ /dev/null @@ -1,59 +0,0 @@ -#!/bin/bash - -set -e - -# Check if wget exists -if command -v wget &> /dev/null; then - dl_prg="wget -O" -elif command -v curl &> /dev/null; then - dl_prg="curl -o" -else - echo "Neither wget nor curl found, exiting .." - exit 1 -fi - -case "$1" in -"s"|"m"|"l"|"xl") - ;; -*) - echo "Invalid argument. Must be one of 's', 'm', 'l' or 'xl'." - exit 1 - ;; -esac - -SAMPLE_NAME=crawl-${1:-m} -SAMPLE_DIR="node-1/samples/${SAMPLE_NAME}/" - -function download_model { - model=$1 - url=$2 - - if [ ! -f $model ]; then - echo "** Downloading $url" - $dl_prg $model $url - fi -} - -pushd $(dirname $0) - -if [ -d ${SAMPLE_DIR} ]; then - echo "${SAMPLE_DIR} already exists; remove it if you want to re-download the sample" -fi - -mkdir -p node-1/samples/ -SAMPLE_TARBALL=samples/${SAMPLE_NAME}.tar.gz -download_model ${SAMPLE_TARBALL}.tmp https://downloads.marginalia.nu/${SAMPLE_TARBALL} && mv ${SAMPLE_TARBALL}.tmp ${SAMPLE_TARBALL} - -if [ ! -f ${SAMPLE_TARBALL} ]; then - echo "!! Failed" - exit 255 -fi - -mkdir -p ${SAMPLE_DIR} -tar zxf ${SAMPLE_TARBALL} --strip-components=1 -C ${SAMPLE_DIR} - -cat > "${SAMPLE_DIR}/marginalia-manifest.json" < ``` -### 4. Bring the system online. +To install the system, you need to run the install script. It will prompt +you for which installation mode you want to use. The options are: -We'll run it in the foreground in the terminal this time because it's educational to see the logs. -Add `-d` to run in the background. +1. Barebones - This will install a white-label search engine with no data. You can + use this to index your own data. It disables and hides functionality that is strongly + related to the Marginalia project, such as the Marginalia GUI. +2. Full Marginalia Search instance - This will install an instance of the search engine + configured like [search.marginalia.nu](https://search.marginalia.nu). This is useful + for local development and testing. + +It will also prompt you for account details for a new mariadb instance, which will be +created for you. The database will be initialized with the schema and data required +for the search engine to run. + +After filling out all the details, the script will copy the installation files to the +specified directory. + +### 4. Run the system ```shell -$ docker-compose up +$ cd install_directory +$ docker-compose up -d +# To see the logs: +$ docker-compose logs -f ``` -There are two docker-compose files available, `docker-compose.yml` and `docker-compose-barebones.yml`; -the latter is a stripped down version that only runs the bare minimum required to run the system, for e.g. -running a whitelabel version of the system. The former is the full system with all the frills of -Marginalia Search, and is the one used by default. +You can now access a search interface at `http://localhost:8080`, and the admin interface +at `http://localhost:8081/`. -To start the barebones version, run: - -```shell -$ docker-compose -f docker-compose-barebones.yml up -``` - -### 5. You should now be able to access the system. - -By default, the docker-compose file publishes the following ports: - -| Address | Description | -|-------------------------|------------------| -| http://localhost:8080/ | User-facing GUI | -| http://localhost:8081/ | Operator's GUI | - -Note that the operator's GUI does not perform any sort of authentication. -Preferably don't expose it publicly, but if you absolutely must, use a proxy or -Basic Auth to add security. - -### 6. Download Sample Data - -A script is available for downloading sample data. The script will download the -data from https://downloads.marginalia.nu/ and extract it to the correct location. - -The system will pick the data up automatically. - -```shell -$ run/download-samples.sh l -``` - -Four sets are available: - -| Name | Description | -|------|---------------------------------| -| s | Small set, 1000 domains | -| m | Medium set, 2000 domains | -| l | Large set, 5000 domains | -| xl | Extra large set, 50,000 domains | - -Warning: The XL set is intended to provide a large amount of data for -setting up a pre-production environment. It may be hard to run on a smaller -machine and will on most machines take several hours to process. - -The 'm' or 'l' sets are a good compromise between size and processing time -and should work on most machines. - -### 7. Process the data - -Bring the system online if it isn't (see step 4), then go to the operator's -GUI (see step 5). - -* Go to `Node 1 -> Storage -> Crawl Data` -* Hit the toggle to set your crawl data to be active -* Go to `Actions -> Process Crawl Data -> [Trigger Reprocessing]` - -This will take anywhere between a few minutes to a few hours depending on which -data set you downloaded. You can monitor the progress from the `Overview` tab. - -First the CONVERTER is expected to run; this will process the data into a format -that can easily be inserted into the database and index. - -Next the LOADER will run; this will insert the data into the database and index. - -Next the link database will repartition itself, and finally the index will be -reconstructed. You can view the process of these steps in the `Jobs` listing. - -### 8. Run the system - -Once all this is done, you can go to the user-facing GUI (see step 5) and try -a search. - -Important! Use the 'No Ranking' option when running locally, since you'll very -likely not have enough links for the ranking algorithm to perform well. - -## Experiment Runner - -The script `experiment.sh` is a launcher for the experiment runner, which is useful when -evaluating new algorithms in processing crawl data. +There is no data in the system yet. To load data into the system, +see the guide at [https://docs.marginalia.nu/](https://docs.marginalia.nu/).