History

Viktor Lofgren 91c7960800 (crawler) Extract additional configuration properties This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties. The documentation is updated to reflect the change. Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.		2024-01-20 10:36:04 +01:00
..
src	(crawler) Extract additional configuration properties	2024-01-20 10:36:04 +01:00
build.gradle	(crawler) Switch hash function in crawler	2023-12-27 13:29:00 +01:00
readme.md	(docs) Improve architectural documentation for the crawler.	2023-11-30 21:30:57 +01:00

readme.md

Crawling Process

The crawling process downloads HTML and saves them into per-domain snapshots. The crawler seeks out HTML documents, and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler does not follow links to other domains within a single job.

Robots Rules

A significant part of the crawler is dealing with robots.txt and similar, rate limiting headers; especially when these are not served in a standard way (which is very common). RFC9390 as well as Google's Robots.txt Specifications are good references.

Re-crawling

The crawler can use old crawl data to avoid re-downloading documents that have not changed. This is done by comparing the old and new documents using the HTTP If-Modified-Since and If-None-Match headers. If a large proportion of the documents have not changed, the crawler falls into a mode where it only randomly samples a few documents from each domain, to avoid wasting time and resources on domains that have not changed.

Sitemaps and rss-feeds

On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.

Central Classes

CrawlerMain orchestrates the crawling.
CrawlerRetreiver visits known addresses from a domain and downloads each document.
HttpFetcher fetches URLs.