CatgirlIntelligenceAgency/code/process-models/crawling-model
Viktor Lofgren 1328bc4938 (warc) Clean up parquet conversion
This commit cleans up the warc->parquet conversion.  Records with a http status other than 200 are now included.

The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.

The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
2023-12-14 16:05:48 +01:00
..
src (warc) Clean up parquet conversion 2023-12-14 16:05:48 +01:00
build.gradle (crawling-model) Implement a parquet format for crawl data 2023-12-13 16:22:19 +01:00
readme.md (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00

Crawling Models

Contains models shared by the crawling-process and converting-process.

Central Classes

Serialization