CatgirlIntelligenceAgency/code/process-models/crawl-spec
Viktor Lofgren 1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
..
java/nu/marginalia (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
build.gradle (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
readme.md (process-models) Improve documentation 2024-02-15 12:21:12 +01:00

Crawl Spec

A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:

  • domain: The domain to be crawled
  • crawlDepth: The depth to which the domain should be crawled
  • urls: A list of known URLs to be crawled

Crawl specs are used to define the scope of a crawl in the absence of known domains.

The CrawlSpecRecord class is used to represent a record in the crawl spec.

The CrawlSpecRecordParquetFileReader and CrawlSpecRecordParquetFileWriter classes are used to read and write the crawl spec parquet files.