CatgirlIntelligenceAgency/code/process-models/crawling-model
Viktor Lofgren fa81e5b8ee (warc) Use a non-standard WARC header to convey information about whether a website uses cookies
This information is then propagated to the parquet file as a boolean.

For documents that are copied from the reference, use whatever value we last saw.  This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
2023-12-15 16:37:53 +01:00
..
src (warc) Use a non-standard WARC header to convey information about whether a website uses cookies 2023-12-15 16:37:53 +01:00
build.gradle (crawling-model) Implement a parquet format for crawl data 2023-12-13 16:22:19 +01:00
readme.md (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00

Crawling Models

Contains models shared by the crawling-process and converting-process.

Central Classes

Serialization