This is to enable running an external repository for production and test.
Use the ./gradle -Pdocker-registry=registry.foo.bar -Pdocker-tag=my-tag while building to accomplish this. By default, use 'marginalia' for repository and 'latest' as tag.
This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around...
Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE.
We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.
Support for anchor tag keywords
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet. Due to size limitations on github, this is available at https://downloads.marginalia.nu/exports
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* The ranking algorithm was tweaked to make better use of ngram information as well as weighting the priority BM25
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
* Crawler will also use the anchor tag file to prioritize crawling documents with external links.
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files.
Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.