CatgirlIntelligenceAgency/code/process-models/crawling-model/src
Viktor Lofgren 2e536e3141 (crawler) Add timestamp to CrawledDocument records
This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream.

The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type.  This is to avoid having to do format conversions when writing and reading the data.

This parquet field populates the timestamp field in CrawledDocument.
2023-12-15 20:23:27 +01:00
..
main/java (crawler) Add timestamp to CrawledDocument records 2023-12-15 20:23:27 +01:00
test/java/nu/marginalia/crawling/parquet (crawler) Add timestamp to CrawledDocument records 2023-12-15 20:23:27 +01:00