(doc) Update Crawling Docs

Still a WIP, but now more accurately reflects the new GUI, with screenshots to boot!
This commit is contained in:
Viktor Lofgren 2024-01-15 16:06:59 +01:00
parent fd1eec99b5
commit b9445d4f62
5 changed files with 35 additions and 35 deletions

View File

@ -3,79 +3,79 @@
This document is a draft. This document is a draft.
## WARNING ## WARNING
Please don't run the crawler unless you intend to actually operate a public
facing search engine! For testing, use crawl sets from downloads.marginalia.nu instead.
See the documentation in run/ for more information. Please don't run the crawler unless you intend to actually operate a public
facing search engine! For testing, use crawl sets from [downloads.marginalia.nu](https://downloads.marginalia.nu/) instead;
or if you wish to play with the crawler, crawl a small set of domains from people who are
ok with it, use your own, your friends, or any subdomain from marginalia.nu.
See the documentation in run/ for more information on how to load sample data!
Reckless crawling annoys webmasters and makes it harder to run an independent search engine. Reckless crawling annoys webmasters and makes it harder to run an independent search engine.
Crawling from a domestic IP address is also likely to put you on a greylist Crawling from a domestic IP address is also likely to put you on a greylist
of probable bots. You will solve CAPTCHAs for almost every website you visit of probable bots. You will solve CAPTCHAs for almost every website you visit
for weeks. for weeks, and may be permanently blocked from a few IPs.
## Prerequisites ## Prerequisites
You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of
DNS traffic. DNS traffic.
These processes require a lot of disk space. It's strongly recommended to use a dedicated disk, These processes require a lot of disk space. It's strongly recommended to use a dedicated disk for
it doesn't need to be extremely fast, but it should be a few terabytes in size. It should be mounted the index storage subdirectory, it doesn't need to be extremely fast, but it should be a few terabytes in size.
with `noatime` and partitioned with a large block size. It may be a good idea to format the disk with
a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler. It should be mounted with `noatime`. It may be a good idea to format the disk with a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler.
Make sure you configure the user-agent properly. This will be used to identify the crawler, Make sure you configure the user-agent properly. This will be used to identify the crawler,
and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it. and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it.
See [wiki://Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard) for more information
This can be done by editing the file `${WMSA_HOME}/conf/user-agent`. about robots.txt; the user agent can be configured in conf/properties/system.properties; see the
[system-properties](system-properties.md) documentation for more information.
## Setup ## Setup
Ensure that the system is running and go to https://localhost:8081. Ensure that the system is running and go to https://localhost:8081.
With the default test configuration, the system is configured to With the default test configuration, the system is configured to
store data in `run/node-1/samples`. store data in `node-1/storage`.
### Specifications ## Fresh Crawl
While a running search engine can use the link database to figure out which websites to visit, a clean While a running search engine can use the link database to figure out which websites to visit, a clean
system does not know of any links. To bootstrap a crawl, a crawl specification can be created. system does not know of any links. To bootstrap a crawl, a crawl specification needs to be created to
seed the domain database.
You need a list of known domains. This is just a text file with one domain name per line, Go to `Nodes->Node 1->Actions->New Crawl`
with blank lines and comments starting with `#` ignored. Make it available over HTTP(S).
Go to ![img](images/new_crawl.png)
* System -> Nodes Click the link that says 'New Spec' to arrive at a form for creating a new specification:
* Select node 1
* Storage -> Specs
* Click `[Create New Specification]`
Fill out the form with a description and a link to the domain list. Fill out the form with a description and a link to the domain list.
## Crawling ## Crawling
Refresh the specification list in the operator's gui. You should see your new specification in the list. If you aren't redirected there automatically, go back to the `New Crawl` page under Node 1 -> Actions.
Click the link, then select `[Crawl]` under `Actions`. Your new specification should now be listed.
Depending on the size of the specification, this may take anywhere between a few minutes to a few weeks. Check the box next to it, and click `[Trigger New Crawl]`.
You can follow the progress in the `Overview` view. It's fairly common for the crawler to get stuck at
99%, this is from the crawler finishing up the largest sites. It will abort if no progress has been made
in five hours.
You can manually also abort the crawler by going to ![img](images/new_crawl2.png)
* System -> Nodes -> `[your node]` -> Actors. This will start the crawling process. Crawling may take a while, depending on the size
of the domain list and the size of the websites.
Toggle both CRAWL and PROC_CRAWLER_SPAWNER to `[OFF]`. ![img](images/crawl_in_progress.png)
CRAWL controls the larger crawler process, and PROC_CRAWLER_SPAWNER spawns the actual Eventually a process bar will show up, and the crawl will start. When it reaches 100%, the crawl is done.
crawler process. The crawler will be aborted, but the crawl data should be intact. You can also monitor the `Events Summary` table on the same page to see what happened after the fact.
At this point you'll want to set PROC_CRAWLER_SPAWNER back to `[ON]`, as the crawler It is expected that the crawl will stall out toward the end of the process, this is a statistical effect since
won't be able to start until it's set to this mode. the largest websites take the longest to finish, and tend to be the ones lingering at 99% or so completion. The
crawler has a timeout of 5 hours, where if no new domains are finished crawling, it will stop, to prevent crawler traps
from stalling the crawl indefinitely.
!!! FIXME: This UX kinda sucks, should be an abort button ideally, none of this having to toggle **Be sure to read the section on re-crawling!**
circuit breakers on and off.
## Converting ## Converting

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

BIN
doc/images/new_crawl.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

BIN
doc/images/new_crawl2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

BIN
doc/images/new_spec.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB