History

Viktor Lofgren e696fd9e92 (docs) Begin un-fucking the docs after refactoring		2024-02-27 21:22:21 +01:00
..
java/nu/marginalia	(refac) Remove src/main from all source code paths.	2024-02-23 16:13:40 +01:00
build.gradle	(refac) Remove src/main from all source code paths.	2024-02-23 16:13:40 +01:00
readme.md	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00

Crawl Spec

A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:

Crawl specs are used to define the scope of a crawl in the absence of known domains.

The CrawlSpecRecord class is used to represent a record in the crawl spec.

The CrawlSpecRecordParquetFileReader and CrawlSpecRecordParquetFileWriter classes are used to read and write the crawl spec parquet files.