5c040f7a46
* Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
readme.md |
Crawling Process
The crawling process downloads HTML and saves them into per-domain snapshots.
Central Classes
- CrawlerMain orchestrates the crawling.
- CrawlerRetreiver visits known addresses from a domain and downloads each document.
- HttpFetcher fetches a URL.