There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes.
Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s.
The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...
This is not hooked into anything yet. The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays. This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.
This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish.
There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.
A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.