Commit Graph

100 Commits

Author SHA1 Message Date
Viktor Lofgren
cae1bad274 (*) Add download-sample action, refactor file storage
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.

It also refactors out some leaky abstractions out of FileStorageService.  allocateTemporaryStorage has been renamed allocateStorage.  The storage was never temporary in any scenario...

It also doesn't take a storage base, as there was always only one valid option for this input.  The allocateStorage method finds the appropriate base itself.
2024-01-25 13:36:30 +01:00
Viktor Lofgren
805afad4fe (control) New GUI for exporting crawl data samples
Not going to win any beauty pageants, but this is pretty peripheral functionality.
2024-01-23 17:08:21 +01:00
Viktor Lofgren
6271d5d544 (mq) Add relation tracking between MQ messages for easier tracking and debugging.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID.  This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.

The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.

The change set also improves the consistency of inbox names.  The IndexClient was buggy and populated its outbox with a UUID.  This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
2024-01-18 15:08:27 +01:00
Viktor Lofgren
19e781b104 (control) Add basic input validation to node actions
Will present a simple error message when required fields aren't populated, instead of a cryptic HTTP status error.
2024-01-18 11:52:49 +01:00
Viktor Lofgren
5a62b3058f (query-api) Make the search set identifier a string value in the API
This will free the core marginalia search engine to use arbitrary search set definitions, while the app can use its hardcoded defaults.
2024-01-16 10:55:24 +01:00
Viktor Lofgren
c41e68aaab (control) New export actions for RSS/Atom feeds and term frequency data
This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.
2024-01-15 14:54:26 +01:00
Viktor Lofgren
5134044530 (assistant) Make assistant client more robust to the service going down
This is especially important for the non-essential functions, like website similarities...
2024-01-13 18:29:30 +01:00
Viktor Lofgren
7c6e18f7a7 (*) Overhaul settings and properties
Use a system.properties file to configure the system.  This is loaded statically by MainClass or ProcessMainClass.  Update the property names to be more consistent, and update the documentations to reflect the changes.
2024-01-13 17:12:18 +01:00
Viktor Lofgren
8dea7217a6 (control) UX fixes, node GUI doesn't break when an executor service goes offline. 2024-01-13 12:17:30 +01:00
Viktor Lofgren
98c0972619 (control) Add a summary table for Actors in the Node overview 2024-01-12 16:32:15 +01:00
Viktor Lofgren
264e2db539 (control) UX-improvements for control service
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views.  It has many small tweaks to make the work flow better.

It also adds a new /uploads directory in each index node, from which sideloaded data can be selected.  This is a bit of a breaking change, as this directory needs to exist in each index node.
2024-01-12 12:33:05 +01:00
Viktor Lofgren
f44222ce53 (control) Add a 'cancel' button to the process list
This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.
2024-01-10 15:02:42 +01:00
Viktor Lofgren
f310ad8d98 (control) Actor terminations work better
Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.
2024-01-10 14:18:49 +01:00
Viktor Lofgren
d56b394bcc (control) GUI for loading external WARC files 2024-01-10 12:13:30 +01:00
Viktor Lofgren
edc1acbb7e (*) Replace EC_DOMAIN_LINK table with files and in-memory caching
The EC_DOMAIN_LINK MariaDB table stores links between domains.  This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB).  This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.

This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains.  This file is loaded in memory in each node, and can be queried via the Query Service.

A migration step is needed before this file is created in each node.   Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.

The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
2024-01-08 15:53:13 +01:00
Viktor Lofgren
4763077b76 (search/index) Add a new keyword "count"
This is for filtering results on how many times the term appears on the domain.  The intent is to be beneficial in creating e.g. a domain search feature.   It's also very helpful when tracking down spammy domains.
2023-12-25 20:38:29 +01:00
Viktor
5bd3934d22
Merge pull request #64 from dreimolo/macos_AS_fix
Macos apple silicon fix, and slight improvements to sample downloader
2023-12-18 18:29:14 +01:00
Viktor Lofgren
a742503508 (search) Add view for showing mutual links between two websites 2023-12-17 17:50:44 +01:00
Viktor Lofgren
c92f1b8df8 (geo-ip) Revert removal of ip2location logic
We do both ip2location and ASN data.

The change also adds some keywords based on autonomous system information, on a somewhat experimental basis.  It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
2023-12-17 15:03:00 +01:00
Viktor Lofgren
d7bd540683 (*) Replace the ip2location IP geolocation data with ASN information from apnic.net.
Doesn't really make sense to use ip2location as a middle man for information that is already freely available...
2023-12-16 21:55:04 +01:00
Viktor Lofgren
117ddd17d7 (assistant) Fix bugs in IP flag emoji generation 2023-12-16 17:07:17 +01:00
dreimolo
c0cc05177f corrects protobuf.plugins.grpc 2023-12-16 14:24:41 +01:00
dreimolo
0b34d43804 workaround for failing mac on apple silicon deps 2023-12-16 14:22:11 +01:00
Viktor Lofgren
45987a1d98 Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
Viktor Lofgren
91dd45cf64 (search) IP and IP geolocation in site info view
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
e3ebb0c5bb (*) Rename the search filter 'RETRO' into 'POPULAR'
This will make the terminology more consistent between the GUI and the code.  The rankings yaml still uses 'retro' though, for to retain compatibility.
2023-12-09 20:06:54 +01:00
Viktor Lofgren
8ef34883a8 (search) Move site information out of the search service and into assistant.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available.  It also permits exposing this information via API in the future if there is interest in this.

The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.

Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
2023-12-09 16:30:06 +01:00
Viktor Lofgren
eccb12b366 (control) Fix spurious state detection in control-side actors
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!

To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
2023-12-09 12:50:05 +01:00
Viktor Lofgren
cc813a5624 (convert) Add basic support for Warc file sideloading
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
2023-12-06 18:43:55 +01:00
Viktor Lofgren
c7934342a6 (control) Automatic recrawl 2023-12-02 17:06:24 +01:00
Viktor Lofgren
4155fbe94c (control) Reprocess-all actor 2023-11-28 17:58:48 +01:00
Viktor Lofgren
347fe6b7be (control) Reindex-all actor 2023-11-28 16:41:09 +01:00
Viktor Lofgren
1dafa0c74d (mqapi/control) Repair repartition endpoint, deprecate notify endpoints.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId.  In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
2023-11-27 16:01:12 +01:00
Viktor Lofgren
858357a246 (metrics) Get prometheus up out of disrepair
* Fix bad labels
* Add nodeId where appropriate
* Hopefully fix histogram buckets for index query times
2023-11-08 14:01:28 +01:00
Viktor Lofgren
c77a5b7cb6 (control) GUI for atags export 2023-10-31 17:55:47 +01:00
Viktor Lofgren
6bac3c75cb (api) API documentation 2023-10-29 16:13:21 +01:00
Viktor Lofgren
d7686b665e Refactoring
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code
2023-10-25 18:51:02 +02:00
Viktor Lofgren
ebd365a128 Fix exception 2023-10-24 15:04:12 +02:00
Viktor Lofgren
c2b28c0f8d (api) Trial streaming API 2023-10-24 13:26:46 +02:00
Viktor Lofgren
a860f8f1a8 (index/qs) GRPC API for better query peformance 2023-10-24 11:38:07 +02:00
Viktor Lofgren
731afcb864 (qs) Parallel execution 2023-10-23 12:06:03 +02:00
Viktor Lofgren
efb73ff4e7 (qs) Don't blow up if an index node isn't responsive 2023-10-23 11:53:18 +02:00
Viktor Lofgren
d76d926c38 (control/executor) Add new configuration options for node
It's now possible to configure prod instance to not retain processed data.
2023-10-20 14:05:19 +02:00
Viktor Lofgren
81dd3809e9 (*) WIP Add node affinity to EC_DOMAIN
Very messy commit due to fractalline yak shaving
2023-10-19 17:48:34 +02:00
Viktor Lofgren
84fea0fd05 (node) Nodes auto-start their monitor actors. 2023-10-16 15:33:22 +02:00
Viktor Lofgren
9a38a455c9 (control/exec) File listings in control GUI 2023-10-15 19:15:44 +02:00
Viktor Lofgren
16e0738731 (*) Get multi-node routing working. 2023-10-15 18:38:30 +02:00
Viktor Lofgren
4baf9527d7 (*) WIP Control GUI redesign, executor-service, multi-node mq
This turned out to be very difficult to do in small isolated steps.

* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
2023-10-14 12:08:43 +02:00
Viktor Lofgren
61288c5e68 (service, client) First steps towards multiple nodedness 2023-10-09 22:13:27 +02:00
Viktor Lofgren
3889c4bdd9 (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00