If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data.
To remedy this without having to dig through the database, a button was added to reset the state. It's a band-aid, but the situation is rare enough that I think it's fine.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs.
The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.
Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.
This is to enable running an external repository for production and test.
Use the ./gradle -Pdocker-registry=registry.foo.bar -Pdocker-tag=my-tag while building to accomplish this. By default, use 'marginalia' for repository and 'latest' as tag.
This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around...
Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE.
We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.
Support for anchor tag keywords
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet. Due to size limitations on github, this is available at https://downloads.marginalia.nu/exports
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* The ranking algorithm was tweaked to make better use of ngram information as well as weighting the priority BM25
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
* Crawler will also use the anchor tag file to prioritize crawling documents with external links.