Commit Graph

331 Commits

Author SHA1 Message Date
Viktor Lofgren
0caef1b307 (warc) Toggle for saving WARC data
Add a toggle for saving the WARC data generated by the search engine's crawler.  Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.

The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.

The warc data is saved in a directory warc/ under the crawl data storage.
2024-01-12 13:45:14 +01:00
Viktor Lofgren
264e2db539 (control) UX-improvements for control service
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views.  It has many small tweaks to make the work flow better.

It also adds a new /uploads directory in each index node, from which sideloaded data can be selected.  This is a bit of a breaking change, as this directory needs to exist in each index node.
2024-01-12 12:33:05 +01:00
Viktor Lofgren
734996002c (*) install script for deploying Marginalia outside the codebase
The changeset also makes the control service responsible for flyway migrations.  This helps reduce the number of places the database configuration needs to be spread out.  These automatic migrations can be disabled with -DdisableFlyway=true.

The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
2024-01-11 12:40:03 +01:00
Viktor Lofgren
a0f28a7f9b (*) Add a barebones configuration
This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills.

The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.
2024-01-10 20:23:51 +01:00
Viktor Lofgren
f44222ce53 (control) Add a 'cancel' button to the process list
This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.
2024-01-10 15:02:42 +01:00
Viktor Lofgren
f310ad8d98 (control) Actor terminations work better
Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.
2024-01-10 14:18:49 +01:00
Viktor Lofgren
d56b394bcc (control) GUI for loading external WARC files 2024-01-10 12:13:30 +01:00
Viktor Lofgren
fbad625126 (linkdb) Add delegating implementation of DomainLinkDb
This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.
2024-01-08 19:56:33 +01:00
Viktor Lofgren
e49ba887e9 (crawl data) Add compatibility layer for old crawl data format
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records.  This is true for the new parquet format, but not for the old zstd/gson format.

To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.

This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.

Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
2024-01-08 19:16:49 +01:00
Viktor Lofgren
edc1acbb7e (*) Replace EC_DOMAIN_LINK table with files and in-memory caching
The EC_DOMAIN_LINK MariaDB table stores links between domains.  This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB).  This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.

This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains.  This file is loaded in memory in each node, and can be queried via the Query Service.

A migration step is needed before this file is created in each node.   Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.

The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
2024-01-08 15:53:13 +01:00
Viktor Lofgren
4078708aea (qs) Better metrics for QS 2024-01-04 13:27:14 +01:00
Viktor Lofgren
7bbaedef97 (search) Add query strategy requiring link 2024-01-03 16:23:00 +01:00
Viktor Lofgren
3caa4eed75 Merge branch 'master' into converter-optimizations 2024-01-02 17:13:25 +01:00
Viktor Lofgren
c70f508ae8 (prometheus) Saner histogram buckets 2024-01-02 17:13:14 +01:00
Viktor Lofgren
87351e89ca Merge branch 'master' into converter-optimizations 2024-01-02 15:17:02 +01:00
Viktor Lofgren
31232e49fb (prometheus) Add instrumentation to the search, qs and index services. 2024-01-02 15:02:29 +01:00
Viktor Lofgren
d2418521a7 (index) Further ranking adjustments 2024-01-02 12:35:59 +01:00
Viktor Lofgren
9330b5b1d9 (index) Adjust rank weightings to fix bad wikipedia results
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero.  This meant that "bad" results always rank the same.  The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.

Some of the weights were also re-adjusted based on what appears to produce better results.  Needs evaluation.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
310a880fa8 (index) Further ranking adjustments 2024-01-02 12:24:52 +01:00
Viktor Lofgren
8f522470ed (index) Adjust rank weightings to fix bad wikipedia results
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero.  This meant that "bad" results always rank the same.  The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.

Some of the weights were also re-adjusted based on what appears to produce better results.  Needs evaluation.
2024-01-01 17:16:29 +01:00
Viktor Lofgren
7f3f3f577c (backup) Add task heartbeats to the backup service 2024-01-01 15:20:57 +01:00
Viktor Lofgren
9707366348 (test) Fix a few slow tests that broke due to domainCount 2023-12-27 13:29:59 +01:00
Viktor Lofgren
4763077b76 (search/index) Add a new keyword "count"
This is for filtering results on how many times the term appears on the domain.  The intent is to be beneficial in creating e.g. a domain search feature.   It's also very helpful when tracking down spammy domains.
2023-12-25 20:38:29 +01:00
Viktor Lofgren
85f906ea53 (executor) Fix removal of stale process heartbeats 2023-12-23 13:49:24 +01:00
Viktor Lofgren
0454447e41 (executor) Implement process removal for long-absent heartbeats
Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.
2023-12-23 13:18:21 +01:00
Viktor Lofgren
7b40c0bbee (assistant) Clean up similar websites' results 2023-12-22 14:07:01 +01:00
Viktor Lofgren
c92f1b8df8 (geo-ip) Revert removal of ip2location logic
We do both ip2location and ASN data.

The change also adds some keywords based on autonomous system information, on a somewhat experimental basis.  It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
2023-12-17 15:03:00 +01:00
Viktor Lofgren
bde68ba48b Merge branch 'master' into asn-info 2023-12-17 14:00:23 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
d7bd540683 (*) Replace the ip2location IP geolocation data with ASN information from apnic.net.
Doesn't really make sense to use ip2location as a middle man for information that is already freely available...
2023-12-16 21:55:04 +01:00
Viktor Lofgren
722b56c8ca (index) Fix rare bug in the index-switching logic
This is caused by a resource contention with the query code.  The proper way to fix this is to use some form of synchronization, but that will slow the code down.  So we just hammer it a few times and let the GC deal with the problem if it fails.  Not optimal, but fast.
2023-12-16 18:57:35 +01:00
Viktor Lofgren
f3f12058dc (assistant) Fix logic error in filtering related domains 2023-12-16 18:45:53 +01:00
Viktor Lofgren
3da38d0483 (assistant) Fix logic error in filtering related domains 2023-12-16 18:44:25 +01:00
Viktor Lofgren
e13fa25e11 (assistant) Clean up the site info related domains view by filtering viable domains 2023-12-16 18:37:09 +01:00
Viktor Lofgren
34d4834ff6 (assistant) Clean up the site info related domains view by filtering viable domains 2023-12-16 18:27:24 +01:00
Viktor Lofgren
440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
Viktor Lofgren
45987a1d98 Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
Viktor Lofgren
f655ec5a5c (*) Refactor GeoIP-related code
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.

The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.

The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.

The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
2023-12-10 17:30:43 +01:00
Viktor Lofgren
91dd45cf64 (search) IP and IP geolocation in site info view
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
e3ebb0c5bb (*) Rename the search filter 'RETRO' into 'POPULAR'
This will make the terminology more consistent between the GUI and the code.  The rankings yaml still uses 'retro' though, for to retain compatibility.
2023-12-09 20:06:54 +01:00
Viktor Lofgren
8ef34883a8 (search) Move site information out of the search service and into assistant.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available.  It also permits exposing this information via API in the future if there is interest in this.

The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.

Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
2023-12-09 16:30:06 +01:00
Viktor Lofgren
eccb12b366 (control) Fix spurious state detection in control-side actors
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!

To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
2023-12-09 12:50:05 +01:00
Viktor Lofgren
cc813a5624 (convert) Add basic support for Warc file sideloading
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
2023-12-06 18:43:55 +01:00
Viktor Lofgren
01621c6344 (renderer) Make helpers configurable on a by-service basis. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
c7934342a6 (control) Automatic recrawl 2023-12-02 17:06:24 +01:00
Viktor Lofgren
f5c324c06b (minor) Fix broken test 2023-12-01 17:44:39 +01:00
Viktor Lofgren
67a1e1c874 (control) GUI for triggering control-side actors 2023-11-29 15:31:14 +01:00
Viktor Lofgren
4155fbe94c (control) Reprocess-all actor 2023-11-28 17:58:48 +01:00
Viktor Lofgren
347fe6b7be (control) Reindex-all actor 2023-11-28 16:41:09 +01:00
Viktor Lofgren
ff3ceb981e (control) Button for removing a stale 'NEW' status
If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data.

To remedy this without having to dig through the database, a button was added to reset the state.  It's a band-aid, but the situation is rare enough that I think it's fine.
2023-11-28 15:18:24 +01:00
Viktor Lofgren
1dafa0c74d (mqapi/control) Repair repartition endpoint, deprecate notify endpoints.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId.  In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
2023-11-27 16:01:12 +01:00
Viktor Lofgren
dd9406d0ac (control) Make storage type tabs consistent
This had fallen off in the Create New Specification view, it lacked Exports.
2023-11-17 11:26:45 +01:00
Viktor Lofgren
e9a01caa5c (index) Fix broken metrics 2023-11-11 12:53:47 +01:00
Viktor Lofgren
858357a246 (metrics) Get prometheus up out of disrepair
* Fix bad labels
* Add nodeId where appropriate
* Hopefully fix histogram buckets for index query times
2023-11-08 14:01:28 +01:00
Viktor Lofgren
0152004c42 Initial Commit Anchor Tags
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
2023-11-04 14:24:17 +01:00
Viktor Lofgren
8e9698c9a0 (control/search) Add ability to suggest removing a site from random exploration
This is what most complaints have been about.
2023-11-02 15:29:49 +01:00
Viktor Lofgren
3047e2dd7c (screenshot-capture-tool) Make screenshot-capture-tool cooperate with docker 2023-11-01 16:38:55 +01:00
Viktor Lofgren
a8b9d21f2d (executor) Refine atag export logic
* Remove obviously uninteresting tags
* Omit URL schema for more sensible sorting
* Change the column order to put the source domain last
2023-11-01 13:23:14 +01:00
Viktor Lofgren
c77a5b7cb6 (control) GUI for atags export 2023-10-31 17:55:47 +01:00
Viktor Lofgren
23f2068e33 (executor) Actor for exporting anchor tag data from crawl data 2023-10-31 17:32:34 +01:00
Viktor Lofgren
ffadfb4149 (control) Use a partial template for the storage types tabs. 2023-10-31 17:12:14 +01:00
Viktor Lofgren
b7e38cfbae (control) Add exports view 2023-10-31 17:08:48 +01:00
Viktor Lofgren
659743b39c (executor) Export Data actor allocates its own storage 2023-10-31 17:04:07 +01:00
Viktor Lofgren
69758c5859 (control) Nicer redirects acknowledging actions 2023-10-31 16:26:29 +01:00
Viktor Lofgren
2871a326e6 (ctrl/exe) Clean up UX and code 2023-10-29 14:09:39 +01:00
Viktor Lofgren
abb42f0f36 (crawler) Fix bug in SQL statement
Arguments were in the wrong order in inserting fetching sites submitted to be crawled
2023-10-29 13:19:17 +01:00
Viktor Lofgren
88f49834fd (docs) Update documentation 2023-10-27 12:45:39 +02:00
Viktor Lofgren
c7cb6664b4 (control) Indicate missing services with danger-color instead of having a distracting and constantly updating last-seen number 2023-10-26 18:05:22 +02:00
Viktor Lofgren
79adba9284 (index) Fix bug in dealing with quoted search terms 2023-10-26 16:28:23 +02:00
Viktor Lofgren
f613f4f2df (array) Fix spurious search results
This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss.

It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.
2023-10-26 15:27:02 +02:00
Viktor Lofgren
abbadc92a0 (exdecutor) Prevent TriggerAdjacencyCalculationActor from showing up in the actions tab when it isn't running 2023-10-25 21:25:07 +02:00
Viktor Lofgren
97fcbdd6d9 (control) Move storage actions into the actions tab
* Also disable annoying CSS animations
2023-10-25 21:23:56 +02:00
Viktor Lofgren
d7686b665e Refactoring
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code
2023-10-25 18:51:02 +02:00
Viktor Lofgren
84cdac83d6 (control) Move message queue monitor to control 2023-10-24 16:44:28 +02:00
Viktor Lofgren
313cc2965c (index-creation) Print whether full or prio is created
Previous state of saying reverse index for both was pretty confusing.
2023-10-24 16:23:10 +02:00
Viktor Lofgren
95f74c5ea7 (control) Filter out heartbeats that are stopped 2023-10-24 16:09:28 +02:00
Viktor Lofgren
0406e76889 (api) Remove logging cruft 2023-10-24 13:39:05 +02:00
Viktor Lofgren
c2b28c0f8d (api) Trial streaming API 2023-10-24 13:26:46 +02:00
Viktor Lofgren
a860f8f1a8 (index/qs) GRPC API for better query peformance 2023-10-24 11:38:07 +02:00
Viktor Lofgren
2ed2f35a9b (actor) Rewrite of the actor prototype class using record pattern matching 2023-10-23 10:18:20 +02:00
Viktor Lofgren
119151cad3 (converter) Separtion of concerns 2023-10-22 14:35:33 +02:00
Viktor Lofgren
758f9b5aa5 (converter) Get UUID pips out of the models
Rendering concerns shouldn't be in the models, it's poor separation of concerns and very difficult to follow.
2023-10-22 14:24:52 +02:00
Viktor Lofgren
eb4158df0b (control) Fix start/stop FSM endpoints 2023-10-22 14:03:09 +02:00
Viktor Lofgren
12fda1a36b (control) Temporarily re-writing the data balancer to get it to work in prod
Need to clean this up later.
2023-10-22 14:03:09 +02:00
Viktor Lofgren
e927f99777 (control) JSON serializes Map<Integer> to Map<Double> and Java gets confused 2023-10-21 16:24:20 +02:00
Viktor Lofgren
044bcf55bd (control) Fix SQL in rebalance actor 2023-10-21 16:13:37 +02:00
Viktor Lofgren
e475af9f49 (control) Initialize controlActorService 2023-10-21 16:06:53 +02:00
Viktor Lofgren
c6abcd91fa (control) Better use of FS states, fix bug with start/stop actors 2023-10-20 16:37:49 +02:00
Viktor Lofgren
d76d926c38 (control/executor) Add new configuration options for node
It's now possible to configure prod instance to not retain processed data.
2023-10-20 14:05:19 +02:00
Viktor Lofgren
2b3c167845 (controller) Additional configuration options for node 2023-10-20 13:13:36 +02:00
Viktor Lofgren
584bb3a648 (fs) interface cleanup 2023-10-20 12:24:18 +02:00
Viktor Lofgren
7b5ec6b98f (executor-service) Embed dist/ in executor-service's docker image 2023-10-19 17:48:34 +02:00
Viktor Lofgren
23526f6d1a (executor) Executor service now pulls DomainType list for CRAWL on "recrawl"
This is an automatic integration with the submit-site repo on github and also
crawl-queue.
2023-10-19 17:48:34 +02:00
Viktor Lofgren
809b3ee023 (control) Update GUI for crawl specs. They are now less important than they were before. 2023-10-19 17:48:34 +02:00
Viktor Lofgren
23f0c79fba (control) GUI for data sets/domain types. 2023-10-19 17:48:34 +02:00
Viktor Lofgren
81dd3809e9 (*) WIP Add node affinity to EC_DOMAIN
Very messy commit due to fractalline yak shaving
2023-10-19 17:48:34 +02:00
Viktor Lofgren
978550f809 (executor-service) Retire features-convert and move the corresponding packages into the executor service. 2023-10-16 15:43:46 +02:00
Viktor Lofgren
84fea0fd05 (node) Nodes auto-start their monitor actors. 2023-10-16 15:33:22 +02:00
Viktor Lofgren
2df3e0f881 (node) Nodes auto-configure on start-up instead of requiring manual configuration. 2023-10-16 14:46:35 +02:00
Viktor Lofgren
ede5d1f890 (actor) Give process spawners more easily recognizable names. 2023-10-16 14:19:00 +02:00