* (executor-api) Make executor API talk GRPC
The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed...
The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.
ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().
The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after...
To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one.
The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader.
Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data.
Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead.
To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.
It also refactors out some leaky abstractions out of FileStorageService. allocateTemporaryStorage has been renamed allocateStorage. The storage was never temporary in any scenario...
It also doesn't take a storage base, as there was always only one valid option for this input. The allocateStorage method finds the appropriate base itself.
This will avoid having to dig in the message queue to perform this relatively common task.
The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.
The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.
The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
It's a confusing default behavior.
This was off for nodes n>1 before as a bandaid since querying indices with no data caused delays and errors. This has been fixed now, so there's no need to do this anymore!
Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.
Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.
A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.
Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.
The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.
The warc data is saved in a directory warc/ under the crawl data storage.
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better.
It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.
The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true.
The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.
The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.
This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service.
A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.
The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.
Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment.
Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters.
Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.
This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.
The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this.
The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.
Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything.
The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.
The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.
This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around...
Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE.
We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files.
Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.