(doc) Update the readme's the crawler, as they've grown stale.
This commit is contained in:
parent
d1e02569f4
commit
d60c6b18d4
@ -4,11 +4,19 @@ The crawling process downloads HTML and saves them into per-domain snapshots. T
|
||||
and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler
|
||||
does not follow links to other domains within a single job.
|
||||
|
||||
The crawler stores data from crawls in-progress in a WARC file. Once the crawl is complete, the WARC file is
|
||||
converted to a parquet file, which is then used by the [converting process](../converting-process/). The intermediate
|
||||
WARC file is not used by any other process, but kept to be able to recover the state of a crawl in case of a crash or
|
||||
other failure.
|
||||
|
||||
If configured so, these crawls may be retained. This is not the default behavior, as the WARC format is not very dense,
|
||||
and the parquet files are much more efficient. However, the WARC files are useful for debugging and integration with
|
||||
other tools.
|
||||
|
||||
## Robots Rules
|
||||
|
||||
A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these
|
||||
are not served in a standard way (which is very common). [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well
|
||||
as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
|
||||
are not served in a standard way (which is very common). [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
|
||||
|
||||
## Re-crawling
|
||||
|
||||
@ -21,7 +29,6 @@ documents from each domain, to avoid wasting time and resources on domains that
|
||||
|
||||
On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.
|
||||
|
||||
|
||||
## Central Classes
|
||||
|
||||
* [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling.
|
||||
|
@ -2,10 +2,10 @@
|
||||
|
||||
## 1. Crawl Process
|
||||
|
||||
The [crawling-process](crawling-process/) fetches website contents and saves them
|
||||
as compressed JSON models described in [crawling-model](../process-models/crawling-model/).
|
||||
The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
|
||||
re-converts them into parquet models. Both are described in [crawling-model](../process-models/crawling-model/).
|
||||
|
||||
The operation is specified by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
|
||||
The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
|
||||
|
||||
## 2. Converting Process
|
||||
|
||||
@ -32,21 +32,13 @@ the data generated by the loader.
|
||||
Schematically the crawling and loading process looks like this:
|
||||
|
||||
```
|
||||
//====================\\
|
||||
|| Compressed JSON: || Specifications
|
||||
|| ID, Domain, Urls[] || File
|
||||
|| ID, Domain, Urls[] ||
|
||||
|| ID, Domain, Urls[] ||
|
||||
|| ... ||
|
||||
\\====================//
|
||||
|
|
||||
+-----------+
|
||||
| CRAWLING | Fetch each URL and
|
||||
| STEP | output to file
|
||||
+-----------+
|
||||
|
|
||||
//========================\\
|
||||
|| Compressed JSON: || Crawl
|
||||
|| Parquet: || Crawl
|
||||
|| Status, HTML[], ... || Files
|
||||
|| Status, HTML[], ... ||
|
||||
|| Status, HTML[], ... ||
|
||||
|
Loading…
Reference in New Issue
Block a user