(doc) Update the readme's the crawler, as they've grown stale.

2024-02-01 18:10:55 +01:00 · 2024-02-01 18:10:55 +01:00 · d60c6b18d4
commit d60c6b18d4
parent d1e02569f4
2 changed files with 14 additions and 15 deletions
--- a/code/processes/crawling-process/readme.md
+++ b/code/processes/crawling-process/readme.md
@ -4,11 +4,19 @@ The crawling process downloads HTML and saves them into per-domain snapshots.  T
 and ignores other types of documents, such as PDFs.  Crawling is done on a domain-by-domain basis, and the crawler
 does not follow links to other domains within a single job.

+The crawler stores data from crawls in-progress in a WARC file.  Once the crawl is complete, the WARC file is
+converted to a parquet file, which is then used by the [converting process](../converting-process/).  The intermediate
+WARC file is not used by any other process, but kept to be able to recover the state of a crawl in case of a crash or
+other failure.
+
+If configured so, these crawls may be retained.  This is not the default behavior, as the WARC format is not very dense,
+and the parquet files are much more efficient.  However, the WARC files are useful for debugging and integration with
+other tools.
+
 ## Robots Rules

 A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these
-are not served in a standard way (which is very common).  [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well 
-as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
+are not served in a standard way (which is very common).  [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.

 ## Re-crawling

@ -21,7 +29,6 @@ documents from each domain, to avoid wasting time and resources on domains that

 On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.

-
 ## Central Classes

 * [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling.
--- a/code/processes/readme.md
+++ b/code/processes/readme.md
@ -2,10 +2,10 @@

 ## 1. Crawl Process

-The [crawling-process](crawling-process/) fetches website contents and saves them
-as compressed JSON models described in [crawling-model](../process-models/crawling-model/).
+The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
+re-converts them into parquet models.  Both are described in [crawling-model](../process-models/crawling-model/).

-The operation is specified by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
+The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.

 ## 2. Converting Process

@ -32,21 +32,13 @@ the data generated by the loader.
 Schematically the crawling and loading process looks like this:

 ```
-    //====================\\
-    || Compressed JSON:   ||  Specifications
-    || ID, Domain, Urls[] ||  File
-    || ID, Domain, Urls[] ||
-    || ID, Domain, Urls[] ||
-    ||      ...           ||
-    \\====================//
-          |
    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
-    ||  Compressed JSON:      || Crawl
+    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||