CatgirlIntelligenceAgency/code/process-models/crawling-model
Viktor Lofgren dec3b1092d (converter) Fix bugs in conversion
This commit adds a safety check that the URL of the document is from the correct domain.

It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).

It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
2023-12-29 13:58:08 +01:00
..
src (converter) Fix bugs in conversion 2023-12-29 13:58:08 +01:00
build.gradle (warc) Filter WarcResponses based on X-Robots-Tags 2023-12-16 15:58:27 +01:00
readme.md (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00

Crawling Models

Contains models shared by the crawling-process and converting-process.

Central Classes

Serialization