Commit Graph

1036 Commits

Author SHA1 Message Date
Viktor
e8de468b0b
Make executor API talk GRPC (#75)
* (executor-api) Make executor API talk GRPC

The executor's REST API was very fragile and annoying to work with, lacking even basic type safety.  Migrate to use GRPC instead.  GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil.  This is a fairly straightforward change, but it's also large so a solid round of testing is needed...

The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.

ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().

The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
2024-02-08 13:01:12 +01:00
Viktor Lofgren
d83a3bf4e2 (search) Fix broken !w handling
Printf format error derp.
2024-02-08 12:11:33 +01:00
Viktor Lofgren
f2b39ad055 (search) Fix broken !bang handling
!bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken.

It's now repaired, and a test is in place to ensure we know if it breaks again.
2024-02-08 12:05:09 +01:00
Viktor Lofgren
95d1bd98e4 (array) Update documentation, make unsafe configurable
The readme for the array library was extremely out of date.  Updating it with accurate information about how the library works, and a demo that should compile.

Also added a system property for disabling the use of sun.misc.Unsafe.
2024-02-07 12:26:47 +01:00
Viktor Lofgren
8acbc6a6b4 (index-construction) Split repartition into two actions cont'd
Continues 467ba5be20 by breaking out a constant with the name of the primary ranking set.  Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.
2024-02-06 19:54:17 +01:00
Viktor Lofgren
467ba5be20 (index-construction) Split repartition into two actions
This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets.  Since only the first is necessary before the index construction, the rest can be delayed until after...

To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one.

The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader.

Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data.

Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead.

To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.
2024-02-06 17:20:07 +01:00
Viktor Lofgren
29ddf9e61d (doc) Update docs 2024-02-06 16:29:55 +01:00
Viktor Lofgren
92e119cab3 (doc) Update docs 2024-02-06 12:43:42 +01:00
Viktor Lofgren
92049ba8e4 (doc) Update docs 2024-02-06 12:41:28 +01:00
Viktor Lofgren
54330b9921 (*) Remove dead code 2024-02-06 12:41:13 +01:00
Viktor Lofgren
d1aeb030f2 (doc) Update RandomWriteFunnel documentation 2024-02-06 12:35:24 +01:00
Viktor Lofgren
f89274d1ea (minor) Fix broken test
Fallout from changes in endianness made in d986f90074
2024-02-06 12:12:26 +01:00
Viktor Lofgren
7286596fb4 (deps) Remove monkey patched GSON
The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data.

Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.
2024-02-06 12:11:39 +01:00
Viktor Lofgren
a2fc83d94e (control) Add configurable border styling
To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.
2024-02-06 12:05:02 +01:00
Viktor Lofgren
2161799cc3 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:18:00 +01:00
Viktor Lofgren
c88f132057 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:10:03 +01:00
Viktor Lofgren
c6313a5906 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:06:36 +01:00
Viktor Lofgren
eadcdb5bed (minor) Improve error handling, naming logging in IndexResultDecorator 2024-02-05 21:05:44 +01:00
Viktor Lofgren
6e7649b5f7 (loader) Mitigate fragile paging behavior
IndexJournalWriterPagingImpl was modified to not page on number of entries written, but number of (equivalent uncompressed) bytes written.

Since the failure mode if too much data is written per file is quiet corruption of the index, the former behavior was extremely fragile.  The new behavior should consistently ensure that the data is sufficiently small to not cause any integer rollovers.

The change in 6dcc20038c was reverted, as there is really no sane reason to have this configurable in software.
2024-02-05 21:05:03 +01:00
Viktor Lofgren
d986f90074 (index) Fix consistency between RandomFileAssembler implementations
The RandomFileAssembler implementations, introduced in commit 53c575db3f were all acting subtly differently.  The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary.

A test was built to ensure the output of these implementations is equivalent.
2024-02-05 21:01:32 +01:00
Viktor Lofgren
53c575db3f (index-construction) Make random-write file strategy configurable
To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing.  By default, the data is just buffered in RAM.  This works well on a large server, but smaller systems struggle.

To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true.  RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time.  This is relatively slow and uses more than twice the disk size.

A new interface RandomFileAssembler is introduced as an abstraction for this operation.  A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB).  In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.
2024-02-05 12:31:15 +01:00
Viktor Lofgren
6dcc20038c (index-journal) Make index journal page size configurable
Adds a new system property loader.journal-page-size to configure this setting.
2024-02-05 11:26:05 +01:00
Viktor Lofgren
fa145f632b (sideload) Add special handling for sideloaded wiki documents
This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.
2024-02-02 21:22:07 +01:00
Viktor Lofgren
785d8deadd (crawler) Improve meta-tag redirect handling, add tests for redirects.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file.  This works as intended.

Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier.  Added logic to handle this case, amended the test case to verify the new behavior.  Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
2024-02-01 20:30:43 +01:00
Viktor Lofgren
93a2d5afbf (*) Fix poorly named test
Likely old refactoring gore.
2024-02-01 20:08:15 +01:00
Viktor Lofgren
d60c6b18d4 (doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00
Viktor Lofgren
d1e02569f4 (language-processing) Add a system property for configuring which language detection model to use
The flag is `system.languageDetectionModelVersion`.

* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
2024-01-31 13:02:33 +01:00
Viktor Lofgren
9ce67029ca (language-processing) Add a system property for configuring which language detection model to use
The flag is `system.languageDetectionModelVersion`.

* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
2024-01-31 13:02:16 +01:00
Viktor Lofgren
98f3382cea (minor) Fix test and improve error message 2024-01-31 11:53:41 +01:00
Viktor Lofgren
52a0255814 (*) Add flag for disabling ASCII flattening
The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this.  This assumption holds poorly in the wild.

Adding an **experimental** system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior.

IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.
2024-01-31 11:50:59 +01:00
Viktor Lofgren
eb59ac8535 (index-ranking) Adjust the BM25P factors a bit
Since the bleed-flags set by the anchor tags logics have been changed to Site and SiteAdjacent, give them a bit of more importance when set together with ExternalLink.

UrlDomain and UrlPath are also only more consistently only rewarded once.
2024-01-30 21:27:29 +01:00
Viktor Lofgren
6edc318597 (control) Fix typo in URL linking to new-crawl-specs 2024-01-26 10:43:10 +01:00
Viktor Lofgren
182c0cf28e (control) Add warnings about domain data contamination 2024-01-25 18:26:15 +01:00
Viktor Lofgren
0b105b5986 (converter) Update hyperlink text for new crawl spec creation.
Fix minor typo.
2024-01-25 18:05:11 +01:00
Viktor Lofgren
cae1bad274 (*) Add download-sample action, refactor file storage
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.

It also refactors out some leaky abstractions out of FileStorageService.  allocateTemporaryStorage has been renamed allocateStorage.  The storage was never temporary in any scenario...

It also doesn't take a storage base, as there was always only one valid option for this input.  The allocateStorage method finds the appropriate base itself.
2024-01-25 13:36:30 +01:00
Viktor Lofgren
1b8b97b8ec (sample-exporter) Add some limits on sizes and lengths
Tar files will reject entries with filenames over 100b, so we need a limit there.  Also added a maximum size limit to keep the file sizes reasonable.
2024-01-25 11:51:53 +01:00
Viktor Lofgren
c088c25b09 (*) Fix broken test, clean up code 2024-01-24 12:50:41 +01:00
Viktor Lofgren
958d64720e (control) Add a view for restarting aborted processes
This will avoid having to dig in the message queue to perform this relatively common task.

The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.
2024-01-24 12:47:10 +01:00
Viktor Lofgren
805afad4fe (control) New GUI for exporting crawl data samples
Not going to win any beauty pageants, but this is pretty peripheral functionality.
2024-01-23 17:08:21 +01:00
Viktor Lofgren
400f4840ad (*) Fix broken code in jmh 2024-01-23 17:08:21 +01:00
Viktor Lofgren
ee7792596d (*) Fix broken test
Probably shouldn't have tests depending on external data like this...
2024-01-23 12:03:47 +01:00
Viktor Lofgren
0081328aca (converter) Adjust which flags are set by anchor text keywords
It's a mistake to let it bleed into Title, as this is a high quality signal.  We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.
2024-01-23 11:54:00 +01:00
Viktor Lofgren
3fff7f6878 (converter) Fix issue where quality limits were no longer enforced 2024-01-23 11:42:17 +01:00
Viktor Lofgren
f15dd06473 (index) Delayed close() of SearchIndexReader
This avoids concurrent access errors.  This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file.  If pull the rug from under the caller by closing the file, we'll get a SIGSEGV.  Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it.

So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up.  This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers.

Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.
2024-01-23 11:08:41 +01:00
Viktor Lofgren
dd26819d66 (actor) Try to rare data race where a finished job is considered dead. 2024-01-22 21:22:38 +01:00
Viktor Lofgren
a6d257df5b (converter) Update Stackexchange sideload instruction
The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.
2024-01-22 18:29:20 +01:00
Viktor Lofgren
41d896ba3e (converter) Refactor content type check in PlainTextDocumentProcessorPlugin
The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.
2024-01-22 17:52:14 +01:00
Viktor Lofgren
51cdf46645 (control) Improve accessibility in search-to-ban template
This update enhances accessibility by associating labels with the corresponding checkboxes in the search-to-ban template.
2024-01-22 15:01:00 +01:00
Viktor Lofgren
1eb0adf6d3 (array) Add sun.misc.Unsafe variant of LongArray 2024-01-22 13:38:42 +01:00
Viktor Lofgren
40c9d2050f (control) Fully automatic conversion
Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine.

Removed the tool itself.

This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency.  This has been fixed, and :third-party:xz was removed.
2024-01-22 13:03:24 +01:00
Viktor Lofgren
3a325845c7 (mq) Add better error handling in fsm and mq
java.lang.Error:s were not handled properly, leading to mismatch in the bookkeeping of the FSMs.  These are now caught, acted on, and re-thrown.

MqSynchronousInbox also no longer assumes all exceptions are InterruptedException.
2024-01-22 13:03:24 +01:00
Viktor Lofgren
6a1bfd6270 (array) Remove unused 'madvise' code and 3rd party dependency on 'uppend'
This wasn't actually hooked in anywhere.  Removing the dependency and code.  If it turns out we need madvise in the future, we'll re-introducde it.
2024-01-22 13:01:57 +01:00
Viktor Lofgren
b91ea1d7ca (control) Re-add gui for sideloading dirtrees 2024-01-20 18:09:40 +01:00
Viktor Lofgren
c5760cd535 (test) Fix broken test 2024-01-20 13:39:40 +01:00
Viktor Lofgren
91c7960800 (crawler) Extract additional configuration properties
This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties.

The documentation is updated to reflect the change.

Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.
2024-01-20 10:36:04 +01:00
Viktor Lofgren
2079a5574b (control) Update heading in restore backup template
Changed the heading in the partial restore backup page from "Load" to "Restore Backup".
2024-01-19 21:46:53 +01:00
Viktor Lofgren
27ffb8fa8a (converter) Integrate zim->db conversion into automatic encyclopedia processing workflow
Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file.  This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically.

The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.
2024-01-19 13:59:03 +01:00
Viktor Lofgren
22c8fb3f59 (crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified
This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity.  This can be removed in a few months.
2024-01-18 16:02:27 +01:00
Viktor Lofgren
964419803a Fix broken test 2024-01-18 15:42:01 +01:00
Viktor Lofgren
6271d5d544 (mq) Add relation tracking between MQ messages for easier tracking and debugging.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID.  This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.

The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.

The change set also improves the consistency of inbox names.  The IndexClient was buggy and populated its outbox with a UUID.  This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
2024-01-18 15:08:27 +01:00
Viktor Lofgren
175bd310f5 (control) Message queue UX improvements 2024-01-18 13:05:50 +01:00
Viktor Lofgren
67ee6f4126 (control) Clean up filtering UX in Events table 2024-01-18 12:35:39 +01:00
Viktor Lofgren
01b312f14c (*) Make new index nodes accept queries by default
It's a confusing default behavior.

This was off for nodes n>1 before as a bandaid since querying indices with no data caused delays and errors.  This has been fixed now, so there's no need to do this anymore!
2024-01-18 12:05:37 +01:00
Viktor Lofgren
18638c62de (control) Rephrase text 2024-01-18 11:53:10 +01:00
Viktor Lofgren
753d000788 (control) Add toggle for automatic loading of processed data 2024-01-18 11:52:58 +01:00
Viktor Lofgren
19e781b104 (control) Add basic input validation to node actions
Will present a simple error message when required fields aren't populated, instead of a cryptic HTTP status error.
2024-01-18 11:52:49 +01:00
Viktor Lofgren
aa2df327db (index) Prevent index from attempting to answer queries when no index data is loaded
This improves query times, and gets rid of exceptions in the logs when one of the index nodes doesn't have any data loaded, yet is configured to answer queries.
2024-01-18 11:05:45 +01:00
Viktor Lofgren
321fa94b8f (crawler) Fix rare exception in content type handling due to improper length checking of a split() array 2024-01-18 11:05:45 +01:00
Viktor Lofgren
41cdb8f71b (control) Fix broken update button in the update-domain-ranking-set form
id property was on the wrong element.
2024-01-17 18:21:09 +01:00
Viktor Lofgren
304d4c9acf (control) Fix result ordering in the file storage listing view
In some scenarios, such as when restoring storage items from json-manifest on db failure, the file storage view would present the items in a non-chronological order.  Added a sort() operation to mitigate this.
2024-01-17 10:56:30 +01:00
Viktor Lofgren
7fd4c092e3 (control) Clean up UX and accessibility for new domain ranking sets.
The change also adds basic support for error messages in the GUI.
2024-01-17 10:47:14 +01:00
Viktor Lofgren
2fe5705542 (control) GUI for ranking sets
Still missing is some polish, forms don't have proper labels, validation is inconsistent, no error messages, etc.
2024-01-16 17:10:09 +01:00
Viktor Lofgren
e968365858 (index) Use new DomainRankingSets to configure ranking algos in index svc 2024-01-16 12:43:32 +01:00
Viktor Lofgren
36ad4c7466 (db) Add a new configuration object 'domain ranking set' for storing ranking parameters 2024-01-16 12:34:00 +01:00
Viktor Lofgren
5a62b3058f (query-api) Make the search set identifier a string value in the API
This will free the core marginalia search engine to use arbitrary search set definitions, while the app can use its hardcoded defaults.
2024-01-16 10:55:24 +01:00
Viktor Lofgren
a1df9e886a (control) Also clean up stale 'NEW' messages 2024-01-15 16:14:02 +01:00
Viktor Lofgren
fd1eec99b5 (cleanup) Fix broken tests 2024-01-15 15:44:33 +01:00
Viktor Lofgren
e162406d40 (control) New control-side actors for cleaning up stale service heartbeats and message queue entries 2024-01-15 15:44:23 +01:00
Viktor Lofgren
c41e68aaab (control) New export actions for RSS/Atom feeds and term frequency data
This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.
2024-01-15 14:54:26 +01:00
Viktor Lofgren
4665af6c42 (control) Move export data endpoint to actions controller 2024-01-15 11:06:22 +01:00
Viktor Lofgren
c0b15427fe (control) New crawl view should use radio buttons as multiple specs aren't supported 2024-01-15 11:03:47 +01:00
Viktor Lofgren
f29a9d972d (control) Move 'new crawl spec' to /node/:id/actions, out of /node/:id/storage 2024-01-15 11:02:00 +01:00
Viktor Lofgren
b192373ae7 (control) Highlight unavailable items (creating, deleting) in node actions views 2024-01-15 10:47:54 +01:00
Viktor Lofgren
c042650382 (docs) Improve query service documentation 2024-01-13 21:16:45 +01:00
Viktor Lofgren
07a916a720 (search) Give the swipe hint on mobile a nicer finish 2024-01-13 18:51:54 +01:00
Viktor Lofgren
5134044530 (assistant) Make assistant client more robust to the service going down
This is especially important for the non-essential functions, like website similarities...
2024-01-13 18:29:30 +01:00
Viktor Lofgren
4c62065e74 (install) Add two separate templates for the install script
One template is for the full Marginalia Search style install, and the other is for a barebones install with no Marginalia-related fluff.
2024-01-13 18:27:42 +01:00
Viktor Lofgren
d28fc99119 (MainClass) ensure logging isn't loaded before service name is known
This causes logs all to have names like ${sys:service-name}, instead of the service name...
2024-01-13 18:19:50 +01:00
Viktor Lofgren
c9fb45c85f (search) Fix control.hideMarginaliaApp handling 2024-01-13 17:24:15 +01:00
Viktor Lofgren
7c6e18f7a7 (*) Overhaul settings and properties
Use a system.properties file to configure the system.  This is loaded statically by MainClass or ProcessMainClass.  Update the property names to be more consistent, and update the documentations to reflect the changes.
2024-01-13 17:12:18 +01:00
Viktor Lofgren
176b9c9666 (convert) Add sizeHints to legacy serializable cawl data stream
This reduces the maximum memory usage when processing legacy crawl data
2024-01-13 15:50:36 +01:00
Viktor Lofgren
ecd9c35233 (control) Clean up the event log
* Generate fewer uninteresting event messages.
* Display fewer irrelevant fields in the overview table.
2024-01-13 13:28:02 +01:00
Viktor Lofgren
71e32c57d9 (control) Add better timestamps for the events and message queue views
Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.
2024-01-13 13:04:56 +01:00
Viktor Lofgren
2fefd0e4e3 (control) Add better timestamps for the events and message queue views
Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.
2024-01-13 13:03:52 +01:00
Viktor Lofgren
81eaf79a25 (control) UX polish 2024-01-13 12:31:13 +01:00
Viktor Lofgren
8dea7217a6 (control) UX fixes, node GUI doesn't break when an executor service goes offline. 2024-01-13 12:17:30 +01:00
Viktor Lofgren
c0fb9e17e8 (control) Add filter dropdown to message queue table
This makes inspecting the queues for processes much easier, as it's otherwise
often these important messages are drowned out by FSM chatter.
2024-01-12 18:46:17 +01:00
Viktor Lofgren
83776a8dce (control) Wean the ExportDataActor off EC_DOMAIN_LINK
The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead.

The ExportDataActor now uses the QueryClient appropriately.  The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file.

Finally the form for triggering an export was overhauled.
2024-01-12 17:09:11 +01:00
Viktor Lofgren
98c0972619 (control) Add a summary table for Actors in the Node overview 2024-01-12 16:32:15 +01:00
Viktor Lofgren
56d832d661 (control) Adjust the margins of the headings to be consistent 2024-01-12 16:16:57 +01:00