CatgirlIntelligenceAgency/code/features-crawl
Viktor Lofgren 0889b6d247 (warc) Clean up parquet conversion
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.

It also refactors the fetch result, body extraction and content type abstractions.
2023-12-14 20:39:40 +01:00
..
content-type (warc) Clean up parquet conversion 2023-12-14 20:39:40 +01:00
crawl-blocklist (*) Refactor GeoIP-related code 2023-12-10 17:30:43 +01:00
link-parser (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
readme.md Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00

Crawl Features

These are bits of search-engine related code that are relatively isolated pieces of business logic, that benefit from the clarity of being kept separate from the rest of the crawling code.