Commit Graph

65 Commits

Author SHA1 Message Date
Viktor Lofgren
dcc5cfb7c0 (index-journal) Improve documentation and code quality 2024-02-15 10:51:49 +01:00
Viktor Lofgren
8021bd0aae (control) Sort upload listing results
Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename.

The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.
2024-02-15 09:13:40 +01:00
Viktor Lofgren
fab36d6e63 (converter) Loader for reddit data
Adds experimental sideloading support for pusshift.io style reddit data.  This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes.  Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code.  If these can not be found, the tests will shortcircuit as OK.  They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy.
2024-02-14 17:35:44 +01:00
Viktor
e8de468b0b
Make executor API talk GRPC (#75)
* (executor-api) Make executor API talk GRPC

The executor's REST API was very fragile and annoying to work with, lacking even basic type safety.  Migrate to use GRPC instead.  GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil.  This is a fairly straightforward change, but it's also large so a solid round of testing is needed...

The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.

ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().

The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
2024-02-08 13:01:12 +01:00
Viktor Lofgren
467ba5be20 (index-construction) Split repartition into two actions
This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets.  Since only the first is necessary before the index construction, the rest can be delayed until after...

To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one.

The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader.

Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data.

Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead.

To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.
2024-02-06 17:20:07 +01:00
Viktor Lofgren
92e119cab3 (doc) Update docs 2024-02-06 12:43:42 +01:00
Viktor Lofgren
2161799cc3 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:18:00 +01:00
Viktor Lofgren
c88f132057 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:10:03 +01:00
Viktor Lofgren
c6313a5906 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:06:36 +01:00
Viktor Lofgren
cae1bad274 (*) Add download-sample action, refactor file storage
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.

It also refactors out some leaky abstractions out of FileStorageService.  allocateTemporaryStorage has been renamed allocateStorage.  The storage was never temporary in any scenario...

It also doesn't take a storage base, as there was always only one valid option for this input.  The allocateStorage method finds the appropriate base itself.
2024-01-25 13:36:30 +01:00
Viktor Lofgren
805afad4fe (control) New GUI for exporting crawl data samples
Not going to win any beauty pageants, but this is pretty peripheral functionality.
2024-01-23 17:08:21 +01:00
Viktor Lofgren
ee7792596d (*) Fix broken test
Probably shouldn't have tests depending on external data like this...
2024-01-23 12:03:47 +01:00
Viktor Lofgren
dd26819d66 (actor) Try to rare data race where a finished job is considered dead. 2024-01-22 21:22:38 +01:00
Viktor Lofgren
40c9d2050f (control) Fully automatic conversion
Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine.

Removed the tool itself.

This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency.  This has been fixed, and :third-party:xz was removed.
2024-01-22 13:03:24 +01:00
Viktor Lofgren
91c7960800 (crawler) Extract additional configuration properties
This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties.

The documentation is updated to reflect the change.

Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.
2024-01-20 10:36:04 +01:00
Viktor Lofgren
27ffb8fa8a (converter) Integrate zim->db conversion into automatic encyclopedia processing workflow
Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file.  This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically.

The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.
2024-01-19 13:59:03 +01:00
Viktor Lofgren
6271d5d544 (mq) Add relation tracking between MQ messages for easier tracking and debugging.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID.  This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.

The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.

The change set also improves the consistency of inbox names.  The IndexClient was buggy and populated its outbox with a UUID.  This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
2024-01-18 15:08:27 +01:00
Viktor Lofgren
c41e68aaab (control) New export actions for RSS/Atom feeds and term frequency data
This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.
2024-01-15 14:54:26 +01:00
Viktor Lofgren
ecd9c35233 (control) Clean up the event log
* Generate fewer uninteresting event messages.
* Display fewer irrelevant fields in the overview table.
2024-01-13 13:28:02 +01:00
Viktor Lofgren
83776a8dce (control) Wean the ExportDataActor off EC_DOMAIN_LINK
The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead.

The ExportDataActor now uses the QueryClient appropriately.  The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file.

Finally the form for triggering an export was overhauled.
2024-01-12 17:09:11 +01:00
Viktor Lofgren
708a741960 (test) Clean up test usage of migrations
Several tests were manually running migrations in a large copy-paste blob of code.  This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.

A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded.   Existing tests are migrated to use the new code.
2024-01-12 15:55:50 +01:00
Viktor Lofgren
264e2db539 (control) UX-improvements for control service
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views.  It has many small tweaks to make the work flow better.

It also adds a new /uploads directory in each index node, from which sideloaded data can be selected.  This is a bit of a breaking change, as this directory needs to exist in each index node.
2024-01-12 12:33:05 +01:00
Viktor Lofgren
734996002c (*) install script for deploying Marginalia outside the codebase
The changeset also makes the control service responsible for flyway migrations.  This helps reduce the number of places the database configuration needs to be spread out.  These automatic migrations can be disabled with -DdisableFlyway=true.

The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
2024-01-11 12:40:03 +01:00
Viktor Lofgren
f44222ce53 (control) Add a 'cancel' button to the process list
This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.
2024-01-10 15:02:42 +01:00
Viktor Lofgren
f310ad8d98 (control) Actor terminations work better
Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.
2024-01-10 14:18:49 +01:00
Viktor Lofgren
d56b394bcc (control) GUI for loading external WARC files 2024-01-10 12:13:30 +01:00
Viktor Lofgren
e49ba887e9 (crawl data) Add compatibility layer for old crawl data format
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records.  This is true for the new parquet format, but not for the old zstd/gson format.

To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.

This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.

Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
2024-01-08 19:16:49 +01:00
Viktor Lofgren
edc1acbb7e (*) Replace EC_DOMAIN_LINK table with files and in-memory caching
The EC_DOMAIN_LINK MariaDB table stores links between domains.  This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB).  This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.

This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains.  This file is loaded in memory in each node, and can be queried via the Query Service.

A migration step is needed before this file is created in each node.   Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.

The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
2024-01-08 15:53:13 +01:00
Viktor Lofgren
7f3f3f577c (backup) Add task heartbeats to the backup service 2024-01-01 15:20:57 +01:00
Viktor Lofgren
85f906ea53 (executor) Fix removal of stale process heartbeats 2023-12-23 13:49:24 +01:00
Viktor Lofgren
0454447e41 (executor) Implement process removal for long-absent heartbeats
Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.
2023-12-23 13:18:21 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
Viktor Lofgren
cc813a5624 (convert) Add basic support for Warc file sideloading
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
2023-12-06 18:43:55 +01:00
Viktor Lofgren
c7934342a6 (control) Automatic recrawl 2023-12-02 17:06:24 +01:00
Viktor Lofgren
f5c324c06b (minor) Fix broken test 2023-12-01 17:44:39 +01:00
Viktor Lofgren
67a1e1c874 (control) GUI for triggering control-side actors 2023-11-29 15:31:14 +01:00
Viktor Lofgren
4155fbe94c (control) Reprocess-all actor 2023-11-28 17:58:48 +01:00
Viktor Lofgren
1dafa0c74d (mqapi/control) Repair repartition endpoint, deprecate notify endpoints.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId.  In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
2023-11-27 16:01:12 +01:00
Viktor Lofgren
3047e2dd7c (screenshot-capture-tool) Make screenshot-capture-tool cooperate with docker 2023-11-01 16:38:55 +01:00
Viktor Lofgren
a8b9d21f2d (executor) Refine atag export logic
* Remove obviously uninteresting tags
* Omit URL schema for more sensible sorting
* Change the column order to put the source domain last
2023-11-01 13:23:14 +01:00
Viktor Lofgren
c77a5b7cb6 (control) GUI for atags export 2023-10-31 17:55:47 +01:00
Viktor Lofgren
23f2068e33 (executor) Actor for exporting anchor tag data from crawl data 2023-10-31 17:32:34 +01:00
Viktor Lofgren
659743b39c (executor) Export Data actor allocates its own storage 2023-10-31 17:04:07 +01:00
Viktor Lofgren
2871a326e6 (ctrl/exe) Clean up UX and code 2023-10-29 14:09:39 +01:00
Viktor Lofgren
abb42f0f36 (crawler) Fix bug in SQL statement
Arguments were in the wrong order in inserting fetching sites submitted to be crawled
2023-10-29 13:19:17 +01:00
Viktor Lofgren
88f49834fd (docs) Update documentation 2023-10-27 12:45:39 +02:00
Viktor Lofgren
abbadc92a0 (exdecutor) Prevent TriggerAdjacencyCalculationActor from showing up in the actions tab when it isn't running 2023-10-25 21:25:07 +02:00
Viktor Lofgren
d7686b665e Refactoring
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code
2023-10-25 18:51:02 +02:00
Viktor Lofgren
84cdac83d6 (control) Move message queue monitor to control 2023-10-24 16:44:28 +02:00