1d34224416
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that. |
||
---|---|---|
.. | ||
java/nu/marginalia | ||
build.gradle | ||
readme.md |
Crawl Spec
A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:
domain
: The domain to be crawledcrawlDepth
: The depth to which the domain should be crawledurls
: A list of known URLs to be crawled
Crawl specs are used to define the scope of a crawl in the absence of known domains.
The CrawlSpecRecord class is used to represent a record in the crawl spec.
The CrawlSpecRecordParquetFileReader and CrawlSpecRecordParquetFileWriter classes are used to read and write the crawl spec parquet files.