Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
Cleaning out a lot of old junk from the code, and one thing lead to another...
* Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
.async(executor) // <-- optional
.run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP!
Missing is documentation and testing, and some more breaking apart of code.
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.
A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.
The last remaining REST service, the assistant-service, has been migrated to gRPC.
This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels.
Since it's no longer used by anything, RxJava has been removed as a dependency from the project.
Although the current state seems reasonably stable, this is a work-in-progress commit.
The readme for the array library was extremely out of date. Updating it with accurate information about how the library works, and a demo that should compile.
Also added a system property for disabling the use of sun.misc.Unsafe.
The RandomFileAssembler implementations, introduced in commit 53c575db3f were all acting subtly differently. The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary.
A test was built to ensure the output of these implementations is equivalent.
To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing. By default, the data is just buffered in RAM. This works well on a large server, but smaller systems struggle.
To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true. RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time. This is relatively slow and uses more than twice the disk size.
A new interface RandomFileAssembler is introduced as an abstraction for this operation. A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB). In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.
The flag is `system.languageDetectionModelVersion`.
* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
The flag is `system.languageDetectionModelVersion`.
* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild.
Adding an **experimental** system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior.
IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.
java.lang.Error:s were not handled properly, leading to mismatch in the bookkeeping of the FSMs. These are now caught, acted on, and re-thrown.
MqSynchronousInbox also no longer assumes all exceptions are InterruptedException.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.
The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.
The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.
Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.
A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service.
This reduces thread churn in the converter sideloader style processing of regular crawl data.
Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.
We do both ip2location and ASN data.
The change also adds some keywords based on autonomous system information, on a somewhat experimental basis. It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.
It also refactors the fetch result, body extraction and content type abstractions.
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.
The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.
The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.
The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator.
The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!
To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.