CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	e696fd9e92	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor	f85ec28a16	Merge branch 'master' into service-discovery	2024-02-20 11:44:12 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor	d05c916491	Merge pull request #80 from MarginaliaSearch/ranking-algorithms Clean up domain ranking code	2024-02-18 09:52:34 +01:00
Viktor Lofgren	e61e7f44b9	(blacklist) Delay startup of blacklist To help services start faster, the blacklist will no longer block until it's loaded. If such a behavior is desirable, a method was added to explicitly wait for the data.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	296ccc5f8e	(blacklist) Clean up blacklist impl The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod. This change moves the loading to a separate thread entirely. For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.	2024-02-18 08:16:48 +01:00
Viktor Lofgren	9ec262ae00	(domain-ranking) Integrate new ranking logic The change deprecates the 'algorithm' field from the domain ranking set configuration. Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.	2024-02-16 20:22:01 +01:00
Viktor Lofgren	b15f47d80e	(db) Retire the EC_DOMAIN_LINK table Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.	2024-02-08 15:52:30 +01:00
Viktor Lofgren	c088c25b09	(*) Fix broken test, clean up code	2024-01-24 12:50:41 +01:00
Viktor Lofgren	c5760cd535	(test) Fix broken test	2024-01-20 13:39:40 +01:00
Viktor Lofgren	6271d5d544	(mq) Add relation tracking between MQ messages for easier tracking and debugging. The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers. The existing RELATED_ID field has too many semantics associated with them, among other things the FSM code uses them this field in tracking state changes. The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.	2024-01-18 15:08:27 +01:00
Viktor Lofgren	2fe5705542	(control) GUI for ranking sets Still missing is some polish, forms don't have proper labels, validation is inconsistent, no error messages, etc.	2024-01-16 17:10:09 +01:00
Viktor Lofgren	36ad4c7466	(db) Add a new configuration object 'domain ranking set' for storing ranking parameters	2024-01-16 12:34:00 +01:00
Viktor Lofgren	7c6e18f7a7	(*) Overhaul settings and properties Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.	2024-01-13 17:12:18 +01:00
Viktor Lofgren	708a741960	(test) Clean up test usage of migrations Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing. A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.	2024-01-12 15:55:50 +01:00
Viktor Lofgren	0caef1b307	(warc) Toggle for saving WARC data Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest. The warc files are concatenated into larger archives, up to about 1 GB each. An index is also created containing filenames, domain names, offsets and sizes to help navigate these larger archives. The warc data is saved in a directory warc/ under the crawl data storage.	2024-01-12 13:45:14 +01:00
Viktor Lofgren	dd507a3808	(db) Fix migrations, bump flyway to 10.0.1 Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.	2023-11-21 20:04:35 +01:00
Viktor Lofgren	f58a9f46be	(loader) Don't truncate the entire links table on load This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around... Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE. We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.	2023-11-16 10:30:12 +01:00
Viktor Lofgren	7aa2f80117	(domain) id.au should be treated as a TLD	2023-11-06 19:07:47 +01:00
Viktor Lofgren	2b3c167845	(controller) Additional configuration options for node	2023-10-20 13:13:36 +02:00
Viktor Lofgren	23526f6d1a	(executor) Executor service now pulls DomainType list for CRAWL on "recrawl" This is an automatic integration with the submit-site repo on github and also crawl-queue.	2023-10-19 17:48:34 +02:00
Viktor Lofgren	23f0c79fba	(control) GUI for data sets/domain types.	2023-10-19 17:48:34 +02:00
Viktor Lofgren	81dd3809e9	(*) WIP Add node affinity to EC_DOMAIN Very messy commit due to fractalline yak shaving	2023-10-19 17:48:34 +02:00
Viktor Lofgren	2df3e0f881	(node) Nodes auto-configure on start-up instead of requiring manual configuration.	2023-10-16 14:46:35 +02:00
Viktor Lofgren	16e0738731	(*) Get multi-node routing working.	2023-10-15 18:38:30 +02:00
Viktor Lofgren	a9dff407a1	(config/db) Clean up migrations	2023-10-14 20:34:03 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	199c459697	(*) Add node-affinity to services, processes and file storage.	2023-10-10 12:32:22 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	233b51e29e	(test) flag DomainTypesTest as Slow to exclude from regular CI	2023-10-04 12:23:10 +02:00
Viktor Lofgren	13ee31770a	(file storage) Make it possible to override the value returned by getFileStorage(type) with a JVM property.	2023-10-01 12:57:53 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	9338f35cd8	(doc) Remove confusingly outdated ER-diagrams	2023-09-21 15:08:27 +02:00
Viktor Lofgren	75f8ae2815	(file-storage) Use human-readable timestamps in the names of file storage directories	2023-09-21 13:22:53 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor Lofgren	9e185e80ce	(control-service) Add timestamp to file storages.	2023-09-02 14:01:04 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	194a6057dd	(index,control) Recoverable index backups	2023-08-25 14:57:43 +02:00
Viktor Lofgren	e710e057e2	(db) Remove EC_URL and EC_PAGE_DATA from mariadb database	2023-08-25 13:45:03 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	b958acb76a	(file-storage) New File Storage type for linkdb	2023-08-24 09:06:13 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	d6b8b38955	(db) Add indices on SERVICE_EVENTLOG	2023-08-12 15:00:15 +02:00
Viktor Lofgren	6483308bb0	(sql) Update default value for DOMAIN_SELECTION_TYPE	2023-08-11 14:01:15 +02:00
Viktor Lofgren	7440da240d	(blacklist) Fix broken SQL migration	2023-08-11 13:33:35 +02:00
Viktor Lofgren	4f8048be31	(blacklist) Blacklist management	2023-08-10 15:40:07 +02:00
Viktor Lofgren	251fc63b42	(*) Fix merge gore	2023-08-09 13:33:28 +02:00
Viktor Lofgren	afad4f5ebb	(*) last touches	2023-08-07 12:59:33 +02:00

1 2

78 commits