CatgirlIntelligenceAgency/code/process-models/crawling-model
Viktor Lofgren cf935a5331 (converter) Read cookie information
Add an optional new field to CrawledDocument containing information about whether the domain has cookies.  This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.

Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
2023-12-15 18:09:53 +01:00
..
src (converter) Read cookie information 2023-12-15 18:09:53 +01:00
build.gradle (crawling-model) Implement a parquet format for crawl data 2023-12-13 16:22:19 +01:00
readme.md (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00

Crawling Models

Contains models shared by the crawling-process and converting-process.

Central Classes

Serialization