CatgirlIntelligenceAgency/code/process-models/crawl-spec/readme.md
2024-02-27 21:22:21 +01:00

770 B

Crawl Spec

A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:

  • domain: The domain to be crawled
  • crawlDepth: The depth to which the domain should be crawled
  • urls: A list of known URLs to be crawled

Crawl specs are used to define the scope of a crawl in the absence of known domains.

The CrawlSpecRecord class is used to represent a record in the crawl spec.

The CrawlSpecRecordParquetFileReader and CrawlSpecRecordParquetFileWriter classes are used to read and write the crawl spec parquet files.