Commit Graph

1592 Commits

Author SHA1 Message Date
Viktor Lofgren
708a741960 (test) Clean up test usage of migrations
Several tests were manually running migrations in a large copy-paste blob of code.  This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.

A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded.   Existing tests are migrated to use the new code.
2024-01-12 15:55:50 +01:00
Viktor Lofgren
0caef1b307 (warc) Toggle for saving WARC data
Add a toggle for saving the WARC data generated by the search engine's crawler.  Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.

The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.

The warc data is saved in a directory warc/ under the crawl data storage.
2024-01-12 13:45:14 +01:00
Viktor Lofgren
264e2db539 (control) UX-improvements for control service
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views.  It has many small tweaks to make the work flow better.

It also adds a new /uploads directory in each index node, from which sideloaded data can be selected.  This is a bit of a breaking change, as this directory needs to exist in each index node.
2024-01-12 12:33:05 +01:00
Viktor Lofgren
734996002c (*) install script for deploying Marginalia outside the codebase
The changeset also makes the control service responsible for flyway migrations.  This helps reduce the number of places the database configuration needs to be spread out.  These automatic migrations can be disabled with -DdisableFlyway=true.

The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
2024-01-11 12:40:03 +01:00
Viktor Lofgren
205e5016e8 (docs) Document barebones config 2024-01-11 09:43:08 +01:00
Viktor Lofgren
a0f28a7f9b (*) Add a barebones configuration
This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills.

The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.
2024-01-10 20:23:51 +01:00
Viktor Lofgren
14b7680328 (loader) Update the size of the keyword files created by the loader
Previously these ended up being about 200 Mb each, which is wastefully small.  Increasing the size of these files makes the index construction faster.
2024-01-10 17:09:19 +01:00
Viktor Lofgren
f44222ce53 (control) Add a 'cancel' button to the process list
This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.
2024-01-10 15:02:42 +01:00
Viktor Lofgren
f310ad8d98 (control) Actor terminations work better
Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.
2024-01-10 14:18:49 +01:00
Viktor Lofgren
d56b394bcc (control) GUI for loading external WARC files 2024-01-10 12:13:30 +01:00
Viktor Lofgren
55c9501e57 (search) Serve proper content type for static resources 2024-01-10 10:46:51 +01:00
Viktor
fad9575154
Merge pull request #69 from MarginaliaSearch/converter-optimizations
Refactor the DomainProcessor to take advantage of the new crawl data format
2024-01-10 09:46:54 +01:00
Viktor Lofgren
97e11e1ac9 (search) Fix acknowledgement page for domain complaints rendering as plain text
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used.  This method is removed with this change.
2024-01-10 09:37:40 +01:00
Viktor Lofgren
e6a1e164b2 (search) Swap swipe direction for more consistent experience 2024-01-10 09:37:40 +01:00
Viktor Lofgren
e4f8f81e89 (search) Mobile UX improvements.
Swipe right to show filter menu.

Fix CSS bug that caused parts of the menu to not have a background.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
176b3bb526 (search) Toggle for showing recent results
Actually persist the value of the toggle between searches too...
2024-01-10 09:37:39 +01:00
Viktor Lofgren
b07752fa9b (search) Toggle for showing recent results
Will by default show results from the last 2 years.  May need to tune this later.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
68fd0efbde (search) Clean up search results template
Rendering is very slow. Let's see if this has a measurable effect on latency.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
c80d3eb812 (search) Remove dead code 2024-01-10 09:37:35 +01:00
Viktor Lofgren
f9320995d6 (search) When clicking asn-links, show results from the unfiltered view... 2024-01-10 09:37:13 +01:00
Viktor Lofgren
f592c9f04d (search) Fix acknowledgement page for domain complaints rendering as plain text
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used.  This method is removed with this change.
2024-01-10 09:26:34 +01:00
Viktor Lofgren
bd7970fb1f (search) Swap swipe direction for more consistent experience 2024-01-09 13:38:40 +01:00
Viktor Lofgren
c47730f2cc (search) Mobile UX improvements.
Swipe right to show filter menu.

Fix CSS bug that caused parts of the menu to not have a background.
2024-01-09 13:30:30 +01:00
Viktor Lofgren
41cccfd2aa (search) Toggle for showing recent results
Actually persist the value of the toggle between searches too...
2024-01-09 11:36:49 +01:00
Viktor Lofgren
aff690f7d6 (search) Toggle for showing recent results
Will by default show results from the last 2 years.  May need to tune this later.
2024-01-09 11:28:36 +01:00
Viktor Lofgren
d4b0539d39 (search) Clean up search results template
Rendering is very slow. Let's see if this has a measurable effect on latency.
2024-01-08 20:57:40 +01:00
Viktor Lofgren
cb55273769 (search) When clicking asn-links, show results from the unfiltered view... 2024-01-08 20:02:19 +01:00
Viktor Lofgren
fbad625126 (linkdb) Add delegating implementation of DomainLinkDb
This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.
2024-01-08 19:56:33 +01:00
Viktor Lofgren
e49ba887e9 (crawl data) Add compatibility layer for old crawl data format
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records.  This is true for the new parquet format, but not for the old zstd/gson format.

To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.

This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.

Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
2024-01-08 19:16:49 +01:00
Viktor Lofgren
edc1acbb7e (*) Replace EC_DOMAIN_LINK table with files and in-memory caching
The EC_DOMAIN_LINK MariaDB table stores links between domains.  This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB).  This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.

This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains.  This file is loaded in memory in each node, and can be queried via the Query Service.

A migration step is needed before this file is created in each node.   Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.

The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
2024-01-08 15:53:13 +01:00
Viktor Lofgren
d304c10641 Merge branch 'master' into converter-optimizations 2024-01-05 13:22:46 +01:00
Viktor Lofgren
302c53a8e7 (build) Enable reproducible builds in build.gradle
Settings for enabling reproducible builds for all subprojects were added to improve build consistency. This includes preserving file timestamps and ordering files reproducibly.

This is primarily of help for docker, since it uses hashes to determine if a file or image layer has changed.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
ef02b712ad (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
aca217cf9a (qs) Better metrics for QS 2024-01-05 13:22:13 +01:00
Viktor Lofgren
9e3386dbbb (search) Fetch fewer results per page
This is a test to evaluate how this impacts load times.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
fdec565b34 (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-05 13:22:13 +01:00
Viktor Lofgren
33c2188c87 (feature) More trackers 2024-01-05 13:22:13 +01:00
Viktor Lofgren
b3c8fa74cc (feature) Add another doubleclick variant to the adtech trackers 2024-01-05 13:22:13 +01:00
Viktor Lofgren
e53bb70bef (converter) Penalize chatgpt content farm spam 2024-01-05 13:22:13 +01:00
Viktor Lofgren
109bec372c (index) Adjust BM25 parameters 2024-01-05 13:21:52 +01:00
Viktor Lofgren
5c2561d05d (search) Add query strategy requiring link 2024-01-05 13:21:52 +01:00
Viktor Lofgren
0e970b8037 (valuation) Tweaking penalties a bit 2024-01-05 13:21:52 +01:00
Viktor Lofgren
1694b4d6ef (valuation) Increase the penalty for adtech a bit 2024-01-05 13:21:34 +01:00
Viktor Lofgren
396299c1db (index) Reduce the value of site and site-adjacent in BM25P calculations 2024-01-05 13:21:33 +01:00
Viktor Lofgren
71d789aab0 (index) Tweak result valuation renormalization 2024-01-05 13:21:33 +01:00
Viktor Lofgren
41ca50ff0e (build) Enable reproducible builds in build.gradle
Settings for enabling reproducible builds for all subprojects were added to improve build consistency. This includes preserving file timestamps and ordering files reproducibly.

This is primarily of help for docker, since it uses hashes to determine if a file or image layer has changed.
2024-01-05 13:19:59 +01:00
Viktor Lofgren
6d2e14a656 (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:17:29 +01:00
Viktor Lofgren
4078708aea (qs) Better metrics for QS 2024-01-04 13:27:14 +01:00
Viktor Lofgren
343ea9c6d8 (search) Fetch fewer results per page
This is a test to evaluate how this impacts load times.
2024-01-04 13:18:07 +01:00
Viktor Lofgren
60361f88ed (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-03 23:14:03 +01:00