Add an optional new field to CrawledDocument containing information about whether the domain has cookies. This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.
Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
This information is then propagated to the parquet file as a boolean.
For documents that are copied from the reference, use whatever value we last saw. This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.
A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.
Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.
It also refactors the fetch result, body extraction and content type abstractions.
This commit cleans up the warc->parquet conversion. Records with a http status other than 200 are now included.
The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.
The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
This is not hooked into anything yet. The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays. This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.
This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish.
There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.
A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.
This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.
The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.
The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.
The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.
The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this.
The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.
Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator.
The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().
The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance.
It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!
To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data.
Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.
The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like *.ac.ccTld or *.edu.ccTld.
If these conditions are met, the search term "special:academia" is added to the domain.
The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well. The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.
Partially hook in the WarcRecorder into the crawler process. So far it's not read, but should record the crawled documents.
The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything.
The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.
The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
In the future this logic probably needs to move into a separate
service, as it's still quite slow to load. But this fixes response
times and DOS potential of previous version.
If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data.
To remedy this without having to dig through the database, a button was added to reset the state. It's a band-aid, but the situation is rare enough that I think it's fine.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs.
The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.
Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.
This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around...
Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE.
We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files.
Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.
This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss.
It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code