(doc) Update Crawling Docs

Still a WIP, but now more accurately reflects the new GUI, with screenshots to boot!
This commit is contained in:
Viktor Lofgren 2024-01-15 16:08:01 +01:00
parent b9445d4f62
commit ce5ae1931d
2 changed files with 30 additions and 26 deletions

View File

@ -51,9 +51,13 @@ Go to `Nodes->Node 1->Actions->New Crawl`
Click the link that says 'New Spec' to arrive at a form for creating a new specification: Click the link that says 'New Spec' to arrive at a form for creating a new specification:
Fill out the form with a description and a link to the domain list. ![img](images/new_spec.png)
## Crawling Fill out the form with a description and a link to a domain list. The domain list is a text file
with one domain per line, with blank lines and comments starting with `#` ignored. You can use
github raw links for this purpose. For test purposes, you can use this link:
`https://downloads.marginalia.nu/domain-list-test.txt`, which will create a crawl for a few
of marignalia.nu's subdomains.
If you aren't redirected there automatically, go back to the `New Crawl` page under Node 1 -> Actions. If you aren't redirected there automatically, go back to the `New Crawl` page under Node 1 -> Actions.
Your new specification should now be listed. Your new specification should now be listed.
@ -79,35 +83,35 @@ from stalling the crawl indefinitely.
## Converting ## Converting
Once the crawl is finished, you can convert and load the data to a format that can be loaded into the database. Once the crawl is done, the data needs to be processed before its searchable. This is done by going to
`Nodes->Node 1->Actions->Process Crawl Data`.
First you'll want to go to Storage -> Crawl Data, and toggle the `State` field next to your new crawl [screenshot here]
data into `Active`. This will mark it as eligible for processing.
Next, go to Actions -> Process Crawl Data, and click `[Trigger Reprocessing]`. Ensure your crawl data This will start the conversion process. This will again take a while, depending on the size of the crawl.
is visible in the list. This will start the automatic conversion and loading process, which can be followed The process bar will show the progress. When it reaches 100%, the conversion is done, and the data will begin
in the `Overview` view. loading automatically. A cascade of actions is performed in sequence, leading to the data being loaded into the
search engine and an index being constructed. This is all automatic, but depending on the size of the crawl data,
may take a while.
This process will take a while, and will run these discrete steps: When an event `INDEX-SWITCH-OK` is logged in the `Event Summary` table, the data is ready to be searched.
* CONVERT the crawl data into a format that can be loaded into the database ## Re-crawling
* LOAD, load posts into the mariadb database, construct an index journal and sqlite linkdb
* Delete the processed data (optional; depending on node configuration)
* Create a backup of the index journal to be loaded (can be restored later)
* Repartition and create new domain rankings
* Construct a new index
* * Forward
* * Full
* * Priority
* Switch to the new index
All of this is automatic and most of it is visible in the `Overview` view.
## Recrawling (IMPORTANT)
The work flow with a crawl spec was a one-off process to bootstrap the search engine. To keep the search engine up to date, The work flow with a crawl spec was a one-off process to bootstrap the search engine. To keep the search engine up to date,
it is preferable to do a recrawl. This will try to reduce the amount of data that needs to be fetched. it is preferable to do a recrawl. This will try to reduce the amount of data that needs to be fetched.
To trigger a Recrawl, ensure your crawl data is set to active, and then go to Actions -> Trigger Recrawl, To trigger a Recrawl, go to `Nodes->Node 1->Actions->Re-crawl`. This will bring you to a page that looks similar to the
and click `[Trigger Recrawl]`. This will behave much like the old crawling step. Once done, it needs to be first crawl page, where you can select a set of crawl data to use as a source. Select the crawl data you want, and
processed like the old crawl data. press `[Trigger Recrawl]`.
Crawling will proceed as before, but this time, the crawler will try to fetch only the data that has changed since the
last crawl, increasing the number of documents by a percentage. This will typically be much faster than the initial crawl.
### Growing the crawl set
The re-crawl will also pull new domains from the `New Domains` dataset, which is an URL configurable in
`[Top Menu] -> System -> Data Sets`. If a new domain is found, it will be assigned to the present node, and crawled in
the re-crawl.
![Datasets screenshot](images/datasets.png)

BIN
doc/images/datasets.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB