CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	4c62065e74	(install) Add two separate templates for the install script One template is for the full Marginalia Search style install, and the other is for a barebones install with no Marginalia-related fluff.	2024-01-13 18:27:42 +01:00
Viktor Lofgren	d28fc99119	(MainClass) ensure logging isn't loaded before service name is known This causes logs all to have names like ${sys:service-name}, instead of the service name...	2024-01-13 18:19:50 +01:00
Viktor Lofgren	7c6e18f7a7	(*) Overhaul settings and properties Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.	2024-01-13 17:12:18 +01:00
Viktor Lofgren	ecd9c35233	(control) Clean up the event log * Generate fewer uninteresting event messages. * Display fewer irrelevant fields in the overview table.	2024-01-13 13:28:02 +01:00
Viktor Lofgren	8dea7217a6	(control) UX fixes, node GUI doesn't break when an executor service goes offline.	2024-01-13 12:17:30 +01:00
Viktor Lofgren	708a741960	(test) Clean up test usage of migrations Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing. A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.	2024-01-12 15:55:50 +01:00
Viktor Lofgren	0caef1b307	(warc) Toggle for saving WARC data Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest. The warc files are concatenated into larger archives, up to about 1 GB each. An index is also created containing filenames, domain names, offsets and sizes to help navigate these larger archives. The warc data is saved in a directory warc/ under the crawl data storage.	2024-01-12 13:45:14 +01:00
Viktor Lofgren	264e2db539	(control) UX-improvements for control service This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better. It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.	2024-01-12 12:33:05 +01:00
Viktor Lofgren	734996002c	(*) install script for deploying Marginalia outside the codebase The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true. The commit also adds curl to the docker container, to enable docker health checks and interdependencies.	2024-01-11 12:40:03 +01:00
Viktor Lofgren	55c9501e57	(search) Serve proper content type for static resources	2024-01-10 10:46:51 +01:00
Viktor Lofgren	f592c9f04d	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:26:34 +01:00
Viktor Lofgren	fbad625126	(linkdb) Add delegating implementation of DomainLinkDb This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.	2024-01-08 19:56:33 +01:00
Viktor Lofgren	edc1acbb7e	(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.	2024-01-08 15:53:13 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	a7cd490593	(minor) Remove dead code.	2023-12-19 18:58:33 +01:00
Viktor Lofgren	bde68ba48b	Merge branch 'master' into asn-info	2023-12-17 14:00:23 +01:00
Viktor Lofgren	bf44805e69	(*) Rename EdgeDomain$domain into topDomain This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time. Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.	2023-12-17 14:00:07 +01:00
Viktor Lofgren	bcad6492d6	(sideloader) Fix integration problems with sideloaders In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment. Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters. Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.	2023-12-17 13:28:17 +01:00
Viktor Lofgren	d7bd540683	(*) Replace the ip2location IP geolocation data with ASN information from apnic.net. Doesn't really make sense to use ip2location as a middle man for information that is already freely available...	2023-12-16 21:55:04 +01:00
Viktor Lofgren	b74a3ebd85	(crawler) WIP integration of WARC files into the crawler process. At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly. This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled. The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.	2023-12-11 19:32:58 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	8ef34883a8	(search) Move site information out of the search service and into assistant. This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this. The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time. Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.	2023-12-09 16:30:06 +01:00
Viktor Lofgren	072b5fcd12	Implement Warc-recording wrapper for OkHttp3 client This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything. The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'. The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.	2023-12-08 13:49:16 +01:00
Viktor Lofgren	280132dad0	(search) Fix script loading for mobile support	2023-12-02 17:06:40 +01:00
Viktor Lofgren	7c8a60b8cf	(search) Site info view is mostly done Also optimize the rendering a bit to avoid having to allocate huge string buffers, writing directly to Spark's response instead.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	a258f0af7a	(search) Refactor search parameters to include query	2023-12-02 17:06:40 +01:00
Viktor Lofgren	01621c6344	(renderer) Make helpers configurable on a by-service basis.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	1dafa0c74d	(mqapi/control) Repair repartition endpoint, deprecate notify endpoints. The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.	2023-11-27 16:01:12 +01:00
Viktor Lofgren	dd507a3808	(db) Fix migrations, bump flyway to 10.0.1 Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.	2023-11-21 20:04:35 +01:00
Viktor Lofgren	f58a9f46be	(loader) Don't truncate the entire links table on load This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around... Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE. We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.	2023-11-16 10:30:12 +01:00
Viktor Lofgren	858357a246	(metrics) Get prometheus up out of disrepair * Fix bad labels * Add nodeId where appropriate * Hopefully fix histogram buckets for index query times	2023-11-08 14:01:28 +01:00
Viktor Lofgren	7aa2f80117	(domain) id.au should be treated as a TLD	2023-11-06 19:07:47 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	659743b39c	(executor) Export Data actor allocates its own storage	2023-10-31 17:04:07 +01:00
Viktor Lofgren	5d6e0e3790	(log) Clean up logging Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files. Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.	2023-10-29 15:52:17 +01:00
Viktor Lofgren	0f637fb722	(logging) Better logging configurations	2023-10-26 12:48:10 +02:00
Viktor Lofgren	97fcbdd6d9	(control) Move storage actions into the actions tab * Also disable annoying CSS animations	2023-10-25 21:23:56 +02:00
Viktor Lofgren	d7686b665e	Refactoring * Encyclopedia sideloader; permit providing base URL. * Storage base shows node id in GUI * ProcessLivenessMonitorActor restarts automatically * Clean-up of outbox code	2023-10-25 18:51:02 +02:00
Viktor Lofgren	436a55ee1e	(control) Render UUID tooltip with dashes.	2023-10-24 16:37:40 +02:00
Viktor Lofgren	e4bddb4993	(control) Better UUID accessibility	2023-10-23 12:53:53 +02:00
Viktor Lofgren	758f9b5aa5	(converter) Get UUID pips out of the models Rendering concerns shouldn't be in the models, it's poor separation of concerns and very difficult to follow.	2023-10-22 14:24:52 +02:00
Viktor Lofgren	29ce8ca0cf	(db) Reduce db pool size This is a temporary thing	2023-10-22 14:03:09 +02:00
Viktor Lofgren	12fda1a36b	(control) Temporarily re-writing the data balancer to get it to work in prod Need to clean this up later.	2023-10-22 14:03:09 +02:00
Viktor Lofgren	c6abcd91fa	(control) Better use of FS states, fix bug with start/stop actors	2023-10-20 16:37:49 +02:00
Viktor Lofgren	d76d926c38	(control/executor) Add new configuration options for node It's now possible to configure prod instance to not retain processed data.	2023-10-20 14:05:19 +02:00
Viktor Lofgren	2b3c167845	(controller) Additional configuration options for node	2023-10-20 13:13:36 +02:00
Viktor Lofgren	584bb3a648	(fs) interface cleanup	2023-10-20 12:24:18 +02:00
Viktor Lofgren	23526f6d1a	(executor) Executor service now pulls DomainType list for CRAWL on "recrawl" This is an automatic integration with the submit-site repo on github and also crawl-queue.	2023-10-19 17:48:34 +02:00
Viktor Lofgren	23f0c79fba	(control) GUI for data sets/domain types.	2023-10-19 17:48:34 +02:00

1 2 3 4 5

215 Commits