The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.
Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically.
The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.
The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.
The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
In some scenarios, such as when restoring storage items from json-manifest on db failure, the file storage view would present the items in a non-chronological order. Added a sort() operation to mitigate this.
Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.
The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead.
The ExportDataActor now uses the QueryClient appropriately. The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file.
Finally the form for triggering an export was overhauled.
Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.
A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.
Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.
The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.
The warc data is saved in a directory warc/ under the crawl data storage.
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better.
It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.
The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true.
The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.
Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!
To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data.
To remedy this without having to dig through the database, a button was added to reset the state. It's a band-aid, but the situation is rare enough that I think it's fine.