(doc) Update the readme's the crawler, as they've grown stale.

This commit is contained in:
Viktor Lofgren 2024-02-01 18:10:55 +01:00
parent d1e02569f4
commit d60c6b18d4
2 changed files with 14 additions and 15 deletions

View File

@ -4,11 +4,19 @@ The crawling process downloads HTML and saves them into per-domain snapshots. T
and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler
does not follow links to other domains within a single job.
The crawler stores data from crawls in-progress in a WARC file. Once the crawl is complete, the WARC file is
converted to a parquet file, which is then used by the [converting process](../converting-process/). The intermediate
WARC file is not used by any other process, but kept to be able to recover the state of a crawl in case of a crash or
other failure.
If configured so, these crawls may be retained. This is not the default behavior, as the WARC format is not very dense,
and the parquet files are much more efficient. However, the WARC files are useful for debugging and integration with
other tools.
## Robots Rules
A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these
are not served in a standard way (which is very common). [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well
as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
are not served in a standard way (which is very common). [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
## Re-crawling
@ -21,7 +29,6 @@ documents from each domain, to avoid wasting time and resources on domains that
On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.
## Central Classes
* [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling.

View File

@ -2,10 +2,10 @@
## 1. Crawl Process
The [crawling-process](crawling-process/) fetches website contents and saves them
as compressed JSON models described in [crawling-model](../process-models/crawling-model/).
The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
re-converts them into parquet models. Both are described in [crawling-model](../process-models/crawling-model/).
The operation is specified by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
## 2. Converting Process
@ -32,21 +32,13 @@ the data generated by the loader.
Schematically the crawling and loading process looks like this:
```
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Compressed JSON: || Crawl
|| Parquet: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||