Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.
Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.
Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.
The change also refactors the sideloading a bit since it was a bit messy.
The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties.
The documentation is updated to reflect the change.
Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format.
To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.
This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.
Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
This commit adds a safety check that the URL of the document is from the correct domain.
It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).
It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats. This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.
The commit also adds a few tests to this logic.
This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes.
Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s.
The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...
This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.
This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream.
The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type. This is to avoid having to do format conversions when writing and reading the data.
This parquet field populates the timestamp field in CrawledDocument.
Add an optional new field to CrawledDocument containing information about whether the domain has cookies. This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.
Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
This information is then propagated to the parquet file as a boolean.
For documents that are copied from the reference, use whatever value we last saw. This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.
A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.
Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.
It also refactors the fetch result, body extraction and content type abstractions.
This commit cleans up the warc->parquet conversion. Records with a http status other than 200 are now included.
The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.
The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
This is not hooked into anything yet. The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays. This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.
This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish.
There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.
A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.
This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.
The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.
The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.
The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.
The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data.
Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.
This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.
This turned out to be very difficult to do in small isolated steps.
* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.
For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.