CatgirlIntelligenceAgency/code/processes/crawling-process/readme.md

# Crawling Process

The crawling process downloads HTML and saves them into per-domain snapshots.  The crawler seeks out HTML documents,
and ignores other types of documents, such as PDFs.  Crawling is done on a domain-by-domain basis, and the crawler
does not follow links to other domains within a single job.

## Robots Rules

A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these
are not served in a standard way (which is very common).  [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well 
as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.

## Re-crawling

The crawler can use old crawl data to avoid re-downloading documents that have not changed.  This is done by
comparing the old and new documents using the HTTP `If-Modified-Since` and `If-None-Match` headers.  If a large
proportion of the documents have not changed, the crawler falls into a mode where it only randomly samples a few
documents from each domain, to avoid wasting time and resources on domains that have not changed.

## Sitemaps and rss-feeds

On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.


## Central Classes

* [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling.
* [CrawlerRetreiver](src/main/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java)
  visits known addresses from a domain and downloads each document.
* [HttpFetcher](src/main/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)
  fetches URLs.

## See Also

* [features-crawl](../../features-crawl/)
More readmes, clean-up of associated code. 2023-03-04 16:42:31 +01:00			`# Crawling Process`

(docs) Improve architectural documentation for the crawler. 2023-11-30 21:30:57 +01:00			`The crawling process downloads HTML and saves them into per-domain snapshots. The crawler seeks out HTML documents,`
			`and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler`
			`does not follow links to other domains within a single job.`

			`## Robots Rules`

			A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these
			`are not served in a standard way (which is very common). [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well`
			`as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.`

			`## Re-crawling`

			`The crawler can use old crawl data to avoid re-downloading documents that have not changed. This is done by`
			comparing the old and new documents using the HTTP `If-Modified-Since` and `If-None-Match` headers. If a large
			`proportion of the documents have not changed, the crawler falls into a mode where it only randomly samples a few`
			`documents from each domain, to avoid wasting time and resources on domains that have not changed.`

			`## Sitemaps and rss-feeds`

			`On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.`

More readmes, clean-up of associated code. 2023-03-04 16:42:31 +01:00
			`## Central Classes`

			`* [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling.`
			`* [CrawlerRetreiver](src/main/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java)`
			`visits known addresses from a domain and downloads each document.`
(refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00			`* [HttpFetcher](src/main/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)`
(docs) Improve architectural documentation for the crawler. 2023-11-30 21:30:57 +01:00			`fetches URLs.`
More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00
			`## See Also`

(docs) Improve architectural documentation for the crawler. 2023-11-30 21:30:57 +01:00			`* [features-crawl](../../features-crawl/)`