Commit Graph

1090 Commits

Author SHA1 Message Date
Viktor Lofgren
2ee492fb74 (gRPC) Bind gRPC services to an interface
By default gRPC it magically decides on an interface.  The change will explicitly tell it what to use.
2024-02-20 14:22:47 +01:00
Viktor Lofgren
36a5c8b44c (cleanup) Clean up code 2024-02-20 14:22:47 +01:00
Viktor Lofgren
07b625c58d (query-client) Add support for fault-tolerant requests to single node services
Adding a method importantCall that will retry a failing request on each route until it succeeds or the routes run out.
2024-02-20 14:16:05 +01:00
Viktor Lofgren
746a865106 (client) Fix handling of channel refreshes
The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update.  This lead to storms of closing and opening channels whenever an update was received.

The new code is correctly aware that we may talk to multiple nodes.
2024-02-20 14:14:09 +01:00
Viktor
f85ec28a16
Merge branch 'master' into service-discovery 2024-02-20 11:44:12 +01:00
Viktor Lofgren
0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
Viktor
d05c916491
Merge pull request #80 from MarginaliaSearch/ranking-algorithms
Clean up domain ranking code
2024-02-18 09:52:34 +01:00
Viktor Lofgren
c73e43f5c9 (recrawl) Mitigate recrawl-before-load footgun
In the scenario where an operator

* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data

The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!

To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.

More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl.  This should help the DbCrawlSpecProvider to find them regardless of loaded state.

This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
e61e7f44b9 (blacklist) Delay startup of blacklist
To help services start faster, the blacklist will no longer block until it's loaded.  If such a behavior is desirable, a method was added to explicitly wait for the data.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
f9b6ac03c6 (api) Clean up incorrect error handling in GrpcChannelPool 2024-02-18 08:45:35 +01:00
Viktor Lofgren
296ccc5f8e (blacklist) Clean up blacklist impl
The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod.

This change moves the loading to a separate thread entirely.  For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.
2024-02-18 08:16:48 +01:00
Viktor Lofgren
8cb5825617 (search) Temporarily disable the Popular filter
This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything".

It may come back in some shape or form in the future, with some additional tweaking of the rankings...
2024-02-18 08:02:01 +01:00
Viktor Lofgren
cee707abd8 (crawler) Implement domain shuffling in DbCrawlSpecProvider
Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.
2024-02-17 17:47:38 +01:00
Viktor Lofgren
92717a4832 (client) Refactor GrpcStubPool to handle error states
Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub.

The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.
2024-02-17 14:42:26 +01:00
Viktor Lofgren
37a7296759 (sideload) Clean up the sideloading code
Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach.

The reddit sideloader now uses the SideloaderProcessing class.  It also properly sets js-attributes for the sideloaded documents.

The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.
2024-02-17 14:32:36 +01:00
Viktor Lofgren
ebbe49d17b (sideload) Fix sideloading of explicitly selected stackexchange files
Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.
2024-02-17 13:24:04 +01:00
Viktor Lofgren
b7e330855f (control) Update descriptive text in the control GUI 2024-02-16 20:32:31 +01:00
Viktor Lofgren
ac89224fb0 (domain-ranking) Remove lingering mentions of the algorithms field from the GUI 2024-02-16 20:28:37 +01:00
Viktor Lofgren
9ec262ae00 (domain-ranking) Integrate new ranking logic
The change deprecates the 'algorithm' field from the domain ranking set configuration.  Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.
2024-02-16 20:22:01 +01:00
Viktor Lofgren
64acdb5f2a (domain-ranking) Clean up domain ranking
The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable.

Migrating over to use JGraphT to store the link graph
when doing rankings, and using their PageRank implementation.  Also added a modified version that does PersonalizedPageRank.
2024-02-16 18:04:58 +01:00
Viktor Lofgren
a175b36382 (search) Correct accidental regression of the SmallWeb filter 2024-02-15 18:16:56 +01:00
Viktor Lofgren
16526d283c (search) Correct accidental regression of the Vintage filter 2024-02-15 18:13:34 +01:00
Viktor Lofgren
752e677555 (search) Expose getSearchTitle in DecoratedSearchResults 2024-02-15 13:56:44 +01:00
Viktor Lofgren
f796af1ae8 (search) Fix failed refactoring 2024-02-15 13:53:19 +01:00
Viktor Lofgren
2515993536 (search) Fix issue where searchTitle setting gets lost when searching again
It's important that the field names in SearchParameters matches the fields referenced in search-form.hdb, otherwise they will get lost in transit.
2024-02-15 13:52:11 +01:00
Viktor Lofgren
66b3e71e56 (search) Expose more search options
This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias.

The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period.

These options are added to the search interface.  The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well.

The vintage filter is modified to add a temporal bias for the past.
2024-02-15 13:39:51 +01:00
Viktor Lofgren
652d151373 (process-models) Improve documentation 2024-02-15 12:21:12 +01:00
Viktor Lofgren
300b1a1b84 (index-query) Add some tests for the QueryFilter code 2024-02-15 12:03:30 +01:00
Viktor Lofgren
6c3b49417f (index-query) Improve documentation and code quality 2024-02-15 11:33:50 +01:00
Viktor Lofgren
dcc5cfb7c0 (index-journal) Improve documentation and code quality 2024-02-15 10:51:49 +01:00
Viktor
d970836605
Merge pull request #79 from MarginaliaSearch/reddit
(converter) Loader for reddit data

Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.
2024-02-15 09:17:56 +01:00
Viktor Lofgren
8021bd0aae (control) Sort upload listing results
Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename.

The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.
2024-02-15 09:13:40 +01:00
Viktor Lofgren
8f91156d80 (control) Improve sideload UX
The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable.

Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc.  It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.
2024-02-14 18:38:20 +01:00
Viktor Lofgren
fab36d6e63 (converter) Loader for reddit data
Adds experimental sideloading support for pusshift.io style reddit data.  This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes.  Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code.  If these can not be found, the tests will shortcircuit as OK.  They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy.
2024-02-14 17:35:44 +01:00
Viktor Lofgren
3d54879c14 (API, minor) Clean up comments. 2024-02-14 12:09:16 +01:00
Viktor Lofgren
e17fcde865 (API, minor) Remove unnecessary inject. 2024-02-14 12:05:50 +01:00
Viktor Lofgren
6950dffcb4 (API) Fix result order in API results
These results should be presented in the same order as their ranking score.
2024-02-14 11:47:14 +01:00
Viktor Lofgren
02dd5c5853 (converter) Look at properties when deciding pool size
Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter.

If true, a much more conservative default is used, limiting the risk of running out of memory.
2024-02-12 16:24:19 +01:00
Viktor Lofgren
5a1087dbf9 (qs-gui) Update documentation, add param for domain limit 2024-02-12 16:13:48 +01:00
Viktor Lofgren
7564dfeb7a (minor) Correct link in documentation for app services 2024-02-12 15:55:06 +01:00
Viktor Lofgren
10bad635a8 (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 20:00:11 +01:00
Viktor Lofgren
7cc8b0fed5 (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 19:58:55 +01:00
Viktor Lofgren
a77846373b (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 19:48:55 +01:00
Viktor Lofgren
bcd0dabb92 (search) Experimental support for clustering search results
Adds experimental support for clustering search results by e.g. domain.  At a first stage, this is only enabled for the wiki and forum filters.

The commit also cleans up the UrlDetails class, which contained a number of vestigial entries.
2024-02-11 17:31:38 +01:00
Viktor Lofgren
9d68062553 (converter) Make processing pool size configurable 2024-02-10 20:59:08 +01:00
Viktor Lofgren
e66d0b7431 (warc) Minor code clean-up.
Remove redundant String$getBytes().  This is mainly an improvement in code consistency.
2024-02-10 18:30:33 +01:00
Viktor Lofgren
ba26f6ce84 (doc) Documentation corrections 2024-02-10 14:16:01 +01:00
Viktor Lofgren
929caed0b9 (warc) Improve WARC standard adherence
The WARC specification says the records should transparently remove compression.  This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
2024-02-09 20:07:01 +01:00
Viktor Lofgren
8340aa2b6c (warc) Improve WARC standard adherence
The WARC specification says the records should transparently remove compression.  This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
2024-02-09 17:29:21 +01:00
Viktor Lofgren
1188fe3bf0 (conf) Improve naming consistency
Rename the property system.conserve-memory to system.conserveMemory in order to be consistent with other properties in the system.
2024-02-09 14:43:08 +01:00
Viktor Lofgren
b15f47d80e (db) Retire the EC_DOMAIN_LINK table
Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.
2024-02-08 15:52:30 +01:00
Viktor Lofgren
ef261cbbd7 (search) Remove stray spaces in bang commands 2024-02-08 14:46:18 +01:00
Conor Flynn
9d7df87886
(search) Fix broken !ddg handling
https://duckduckgo.com/search?q=asdf leads to running a search for the term "search" instead of "asdf".

Both https://duckduckgo.com/<query> and https://duckduckgo.com/?q=<query> are accepted, but using GET vars seemed more in-keeping with the code.
2024-02-08 13:28:02 +01:00
Viktor Lofgren
a4b2323ca3 (search) Change default search profile to No Filter
Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.
2024-02-08 13:04:05 +01:00
Viktor
e8de468b0b
Make executor API talk GRPC (#75)
* (executor-api) Make executor API talk GRPC

The executor's REST API was very fragile and annoying to work with, lacking even basic type safety.  Migrate to use GRPC instead.  GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil.  This is a fairly straightforward change, but it's also large so a solid round of testing is needed...

The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.

ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().

The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
2024-02-08 13:01:12 +01:00
Viktor Lofgren
d83a3bf4e2 (search) Fix broken !w handling
Printf format error derp.
2024-02-08 12:11:33 +01:00
Viktor Lofgren
f2b39ad055 (search) Fix broken !bang handling
!bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken.

It's now repaired, and a test is in place to ensure we know if it breaks again.
2024-02-08 12:05:09 +01:00
Viktor Lofgren
95d1bd98e4 (array) Update documentation, make unsafe configurable
The readme for the array library was extremely out of date.  Updating it with accurate information about how the library works, and a demo that should compile.

Also added a system property for disabling the use of sun.misc.Unsafe.
2024-02-07 12:26:47 +01:00
Viktor Lofgren
8acbc6a6b4 (index-construction) Split repartition into two actions cont'd
Continues 467ba5be20 by breaking out a constant with the name of the primary ranking set.  Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.
2024-02-06 19:54:17 +01:00
Viktor Lofgren
467ba5be20 (index-construction) Split repartition into two actions
This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets.  Since only the first is necessary before the index construction, the rest can be delayed until after...

To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one.

The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader.

Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data.

Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead.

To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.
2024-02-06 17:20:07 +01:00
Viktor Lofgren
29ddf9e61d (doc) Update docs 2024-02-06 16:29:55 +01:00
Viktor Lofgren
92e119cab3 (doc) Update docs 2024-02-06 12:43:42 +01:00
Viktor Lofgren
92049ba8e4 (doc) Update docs 2024-02-06 12:41:28 +01:00
Viktor Lofgren
54330b9921 (*) Remove dead code 2024-02-06 12:41:13 +01:00
Viktor Lofgren
d1aeb030f2 (doc) Update RandomWriteFunnel documentation 2024-02-06 12:35:24 +01:00
Viktor Lofgren
f89274d1ea (minor) Fix broken test
Fallout from changes in endianness made in d986f90074
2024-02-06 12:12:26 +01:00
Viktor Lofgren
7286596fb4 (deps) Remove monkey patched GSON
The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data.

Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.
2024-02-06 12:11:39 +01:00
Viktor Lofgren
a2fc83d94e (control) Add configurable border styling
To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.
2024-02-06 12:05:02 +01:00
Viktor Lofgren
2161799cc3 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:18:00 +01:00
Viktor Lofgren
c88f132057 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:10:03 +01:00
Viktor Lofgren
c6313a5906 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:06:36 +01:00
Viktor Lofgren
eadcdb5bed (minor) Improve error handling, naming logging in IndexResultDecorator 2024-02-05 21:05:44 +01:00
Viktor Lofgren
6e7649b5f7 (loader) Mitigate fragile paging behavior
IndexJournalWriterPagingImpl was modified to not page on number of entries written, but number of (equivalent uncompressed) bytes written.

Since the failure mode if too much data is written per file is quiet corruption of the index, the former behavior was extremely fragile.  The new behavior should consistently ensure that the data is sufficiently small to not cause any integer rollovers.

The change in 6dcc20038c was reverted, as there is really no sane reason to have this configurable in software.
2024-02-05 21:05:03 +01:00
Viktor Lofgren
d986f90074 (index) Fix consistency between RandomFileAssembler implementations
The RandomFileAssembler implementations, introduced in commit 53c575db3f were all acting subtly differently.  The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary.

A test was built to ensure the output of these implementations is equivalent.
2024-02-05 21:01:32 +01:00
Viktor Lofgren
53c575db3f (index-construction) Make random-write file strategy configurable
To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing.  By default, the data is just buffered in RAM.  This works well on a large server, but smaller systems struggle.

To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true.  RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time.  This is relatively slow and uses more than twice the disk size.

A new interface RandomFileAssembler is introduced as an abstraction for this operation.  A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB).  In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.
2024-02-05 12:31:15 +01:00
Viktor Lofgren
6dcc20038c (index-journal) Make index journal page size configurable
Adds a new system property loader.journal-page-size to configure this setting.
2024-02-05 11:26:05 +01:00
Viktor Lofgren
fa145f632b (sideload) Add special handling for sideloaded wiki documents
This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.
2024-02-02 21:22:07 +01:00
Viktor Lofgren
785d8deadd (crawler) Improve meta-tag redirect handling, add tests for redirects.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file.  This works as intended.

Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier.  Added logic to handle this case, amended the test case to verify the new behavior.  Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
2024-02-01 20:30:43 +01:00
Viktor Lofgren
93a2d5afbf (*) Fix poorly named test
Likely old refactoring gore.
2024-02-01 20:08:15 +01:00
Viktor Lofgren
d60c6b18d4 (doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00
Viktor Lofgren
d1e02569f4 (language-processing) Add a system property for configuring which language detection model to use
The flag is `system.languageDetectionModelVersion`.

* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
2024-01-31 13:02:33 +01:00
Viktor Lofgren
9ce67029ca (language-processing) Add a system property for configuring which language detection model to use
The flag is `system.languageDetectionModelVersion`.

* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
2024-01-31 13:02:16 +01:00
Viktor Lofgren
98f3382cea (minor) Fix test and improve error message 2024-01-31 11:53:41 +01:00
Viktor Lofgren
52a0255814 (*) Add flag for disabling ASCII flattening
The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this.  This assumption holds poorly in the wild.

Adding an **experimental** system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior.

IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.
2024-01-31 11:50:59 +01:00
Viktor Lofgren
eb59ac8535 (index-ranking) Adjust the BM25P factors a bit
Since the bleed-flags set by the anchor tags logics have been changed to Site and SiteAdjacent, give them a bit of more importance when set together with ExternalLink.

UrlDomain and UrlPath are also only more consistently only rewarded once.
2024-01-30 21:27:29 +01:00
Viktor Lofgren
6edc318597 (control) Fix typo in URL linking to new-crawl-specs 2024-01-26 10:43:10 +01:00
Viktor Lofgren
182c0cf28e (control) Add warnings about domain data contamination 2024-01-25 18:26:15 +01:00
Viktor Lofgren
0b105b5986 (converter) Update hyperlink text for new crawl spec creation.
Fix minor typo.
2024-01-25 18:05:11 +01:00
Viktor Lofgren
cae1bad274 (*) Add download-sample action, refactor file storage
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.

It also refactors out some leaky abstractions out of FileStorageService.  allocateTemporaryStorage has been renamed allocateStorage.  The storage was never temporary in any scenario...

It also doesn't take a storage base, as there was always only one valid option for this input.  The allocateStorage method finds the appropriate base itself.
2024-01-25 13:36:30 +01:00
Viktor Lofgren
1b8b97b8ec (sample-exporter) Add some limits on sizes and lengths
Tar files will reject entries with filenames over 100b, so we need a limit there.  Also added a maximum size limit to keep the file sizes reasonable.
2024-01-25 11:51:53 +01:00
Viktor Lofgren
c088c25b09 (*) Fix broken test, clean up code 2024-01-24 12:50:41 +01:00
Viktor Lofgren
958d64720e (control) Add a view for restarting aborted processes
This will avoid having to dig in the message queue to perform this relatively common task.

The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.
2024-01-24 12:47:10 +01:00
Viktor Lofgren
805afad4fe (control) New GUI for exporting crawl data samples
Not going to win any beauty pageants, but this is pretty peripheral functionality.
2024-01-23 17:08:21 +01:00
Viktor Lofgren
400f4840ad (*) Fix broken code in jmh 2024-01-23 17:08:21 +01:00
Viktor Lofgren
ee7792596d (*) Fix broken test
Probably shouldn't have tests depending on external data like this...
2024-01-23 12:03:47 +01:00
Viktor Lofgren
0081328aca (converter) Adjust which flags are set by anchor text keywords
It's a mistake to let it bleed into Title, as this is a high quality signal.  We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.
2024-01-23 11:54:00 +01:00
Viktor Lofgren
3fff7f6878 (converter) Fix issue where quality limits were no longer enforced 2024-01-23 11:42:17 +01:00
Viktor Lofgren
f15dd06473 (index) Delayed close() of SearchIndexReader
This avoids concurrent access errors.  This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file.  If pull the rug from under the caller by closing the file, we'll get a SIGSEGV.  Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it.

So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up.  This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers.

Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.
2024-01-23 11:08:41 +01:00
Viktor Lofgren
dd26819d66 (actor) Try to rare data race where a finished job is considered dead. 2024-01-22 21:22:38 +01:00
Viktor Lofgren
a6d257df5b (converter) Update Stackexchange sideload instruction
The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.
2024-01-22 18:29:20 +01:00