CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	36ad4c7466	(db) Add a new configuration object 'domain ranking set' for storing ranking parameters	2024-01-16 12:34:00 +01:00
Viktor Lofgren	7c6e18f7a7	(*) Overhaul settings and properties Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.	2024-01-13 17:12:18 +01:00
Viktor Lofgren	708a741960	(test) Clean up test usage of migrations Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing. A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.	2024-01-12 15:55:50 +01:00
Viktor Lofgren	0caef1b307	(warc) Toggle for saving WARC data Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest. The warc files are concatenated into larger archives, up to about 1 GB each. An index is also created containing filenames, domain names, offsets and sizes to help navigate these larger archives. The warc data is saved in a directory warc/ under the crawl data storage.	2024-01-12 13:45:14 +01:00
Viktor Lofgren	dd507a3808	(db) Fix migrations, bump flyway to 10.0.1 Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.	2023-11-21 20:04:35 +01:00
Viktor Lofgren	f58a9f46be	(loader) Don't truncate the entire links table on load This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around... Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE. We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.	2023-11-16 10:30:12 +01:00
Viktor Lofgren	7aa2f80117	(domain) id.au should be treated as a TLD	2023-11-06 19:07:47 +01:00
Viktor Lofgren	2b3c167845	(controller) Additional configuration options for node	2023-10-20 13:13:36 +02:00
Viktor Lofgren	23526f6d1a	(executor) Executor service now pulls DomainType list for CRAWL on "recrawl" This is an automatic integration with the submit-site repo on github and also crawl-queue.	2023-10-19 17:48:34 +02:00
Viktor Lofgren	23f0c79fba	(control) GUI for data sets/domain types.	2023-10-19 17:48:34 +02:00
Viktor Lofgren	81dd3809e9	(*) WIP Add node affinity to EC_DOMAIN Very messy commit due to fractalline yak shaving	2023-10-19 17:48:34 +02:00
Viktor Lofgren	2df3e0f881	(node) Nodes auto-configure on start-up instead of requiring manual configuration.	2023-10-16 14:46:35 +02:00
Viktor Lofgren	16e0738731	(*) Get multi-node routing working.	2023-10-15 18:38:30 +02:00
Viktor Lofgren	a9dff407a1	(config/db) Clean up migrations	2023-10-14 20:34:03 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	199c459697	(*) Add node-affinity to services, processes and file storage.	2023-10-10 12:32:22 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	233b51e29e	(test) flag DomainTypesTest as Slow to exclude from regular CI	2023-10-04 12:23:10 +02:00
Viktor Lofgren	13ee31770a	(file storage) Make it possible to override the value returned by getFileStorage(type) with a JVM property.	2023-10-01 12:57:53 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	9338f35cd8	(doc) Remove confusingly outdated ER-diagrams	2023-09-21 15:08:27 +02:00
Viktor Lofgren	75f8ae2815	(file-storage) Use human-readable timestamps in the names of file storage directories	2023-09-21 13:22:53 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor Lofgren	9e185e80ce	(control-service) Add timestamp to file storages.	2023-09-02 14:01:04 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	194a6057dd	(index,control) Recoverable index backups	2023-08-25 14:57:43 +02:00
Viktor Lofgren	e710e057e2	(db) Remove EC_URL and EC_PAGE_DATA from mariadb database	2023-08-25 13:45:03 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	b958acb76a	(file-storage) New File Storage type for linkdb	2023-08-24 09:06:13 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	d6b8b38955	(db) Add indices on SERVICE_EVENTLOG	2023-08-12 15:00:15 +02:00
Viktor Lofgren	6483308bb0	(sql) Update default value for DOMAIN_SELECTION_TYPE	2023-08-11 14:01:15 +02:00
Viktor Lofgren	7440da240d	(blacklist) Fix broken SQL migration	2023-08-11 13:33:35 +02:00
Viktor Lofgren	4f8048be31	(blacklist) Blacklist management	2023-08-10 15:40:07 +02:00
Viktor Lofgren	251fc63b42	(*) Fix merge gore	2023-08-09 13:33:28 +02:00
Viktor Lofgren	afad4f5ebb	(*) last touches	2023-08-07 12:59:33 +02:00
Viktor	52e2ab45bf	Merge branch 'master' into master-control-program	2023-08-07 12:53:43 +02:00
Viktor Lofgren	cdfe284f9a	(file storage) File Storage Type for EXPORT data (file storage) File Storage Type for EXPORT data	2023-08-05 14:45:03 +02:00
Viktor Lofgren	624b78ec3a	(heartbeat) Task heartbeats	2023-08-04 14:40:06 +02:00
Viktor Lofgren	f01f608474	(blacklist) Support blacklists with subdomain	2023-08-03 17:58:52 +02:00
Viktor Lofgren	659d2134ba	(file-storage) Deprecate mustClean flag	2023-08-01 22:32:30 +02:00
Viktor Lofgren	867410c66b	(file-storage) Automatic file storage discovery via manifest file	2023-08-01 18:05:43 +02:00
Viktor Lofgren	58556af6c7	(db) Use flwyay for database migrations.	2023-08-01 17:08:42 +02:00
Viktor Lofgren	2e29038ecd	(db) Fix broken insert statement, move file storage defaults to a separate file.	2023-08-01 15:50:08 +02:00
Viktor Lofgren	c1ea60b399	(db) Default values for storage base	2023-08-01 15:05:04 +02:00
Viktor Lofgren	92cac52813	(mq) Add indexes to MESSAGE_QUEUE	2023-07-28 12:03:51 +02:00
Viktor Lofgren	09fd0a1d0e	(converter) Automatically clean stale file storage records if they disappear on disk	2023-07-24 17:04:42 +02:00
Viktor Lofgren	f91d92cccb	(crawler) WIP	2023-07-20 21:05:16 +02:00
Viktor Lofgren	d7ab21fe34	(*) Refactor Control Service and processes	2023-07-17 21:20:31 +02:00

1 2

65 Commits