Commit Graph

1754 Commits

Author SHA1 Message Date
Viktor
d05c916491
Merge pull request #80 from MarginaliaSearch/ranking-algorithms
Clean up domain ranking code
2024-02-18 09:52:34 +01:00
Viktor Lofgren
c73e43f5c9 (recrawl) Mitigate recrawl-before-load footgun
In the scenario where an operator

* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data

The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!

To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.

More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl.  This should help the DbCrawlSpecProvider to find them regardless of loaded state.

This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
e61e7f44b9 (blacklist) Delay startup of blacklist
To help services start faster, the blacklist will no longer block until it's loaded.  If such a behavior is desirable, a method was added to explicitly wait for the data.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
f9b6ac03c6 (api) Clean up incorrect error handling in GrpcChannelPool 2024-02-18 08:45:35 +01:00
Viktor Lofgren
296ccc5f8e (blacklist) Clean up blacklist impl
The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod.

This change moves the loading to a separate thread entirely.  For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.
2024-02-18 08:16:48 +01:00
Viktor Lofgren
8cb5825617 (search) Temporarily disable the Popular filter
This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything".

It may come back in some shape or form in the future, with some additional tweaking of the rankings...
2024-02-18 08:02:01 +01:00
Viktor Lofgren
cee707abd8 (crawler) Implement domain shuffling in DbCrawlSpecProvider
Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.
2024-02-17 17:47:38 +01:00
Viktor Lofgren
92717a4832 (client) Refactor GrpcStubPool to handle error states
Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub.

The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.
2024-02-17 14:42:26 +01:00
Viktor Lofgren
37a7296759 (sideload) Clean up the sideloading code
Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach.

The reddit sideloader now uses the SideloaderProcessing class.  It also properly sets js-attributes for the sideloaded documents.

The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.
2024-02-17 14:32:36 +01:00
Viktor Lofgren
ebbe49d17b (sideload) Fix sideloading of explicitly selected stackexchange files
Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.
2024-02-17 13:24:04 +01:00
Viktor Lofgren
b7e330855f (control) Update descriptive text in the control GUI 2024-02-16 20:32:31 +01:00
Viktor Lofgren
ac89224fb0 (domain-ranking) Remove lingering mentions of the algorithms field from the GUI 2024-02-16 20:28:37 +01:00
Viktor Lofgren
9ec262ae00 (domain-ranking) Integrate new ranking logic
The change deprecates the 'algorithm' field from the domain ranking set configuration.  Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.
2024-02-16 20:22:01 +01:00
Viktor Lofgren
64acdb5f2a (domain-ranking) Clean up domain ranking
The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable.

Migrating over to use JGraphT to store the link graph
when doing rankings, and using their PageRank implementation.  Also added a modified version that does PersonalizedPageRank.
2024-02-16 18:04:58 +01:00
Viktor Lofgren
a175b36382 (search) Correct accidental regression of the SmallWeb filter 2024-02-15 18:16:56 +01:00
Viktor Lofgren
16526d283c (search) Correct accidental regression of the Vintage filter 2024-02-15 18:13:34 +01:00
Viktor Lofgren
752e677555 (search) Expose getSearchTitle in DecoratedSearchResults 2024-02-15 13:56:44 +01:00
Viktor Lofgren
f796af1ae8 (search) Fix failed refactoring 2024-02-15 13:53:19 +01:00
Viktor Lofgren
2515993536 (search) Fix issue where searchTitle setting gets lost when searching again
It's important that the field names in SearchParameters matches the fields referenced in search-form.hdb, otherwise they will get lost in transit.
2024-02-15 13:52:11 +01:00
Viktor Lofgren
66b3e71e56 (search) Expose more search options
This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias.

The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period.

These options are added to the search interface.  The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well.

The vintage filter is modified to add a temporal bias for the past.
2024-02-15 13:39:51 +01:00
Viktor Lofgren
652d151373 (process-models) Improve documentation 2024-02-15 12:21:12 +01:00
Viktor Lofgren
300b1a1b84 (index-query) Add some tests for the QueryFilter code 2024-02-15 12:03:30 +01:00
Viktor Lofgren
6c3b49417f (index-query) Improve documentation and code quality 2024-02-15 11:33:50 +01:00
Viktor Lofgren
dcc5cfb7c0 (index-journal) Improve documentation and code quality 2024-02-15 10:51:49 +01:00
Viktor
d970836605
Merge pull request #79 from MarginaliaSearch/reddit
(converter) Loader for reddit data

Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.
2024-02-15 09:17:56 +01:00
Viktor Lofgren
8021bd0aae (control) Sort upload listing results
Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename.

The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.
2024-02-15 09:13:40 +01:00
Viktor Lofgren
8f91156d80 (control) Improve sideload UX
The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable.

Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc.  It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.
2024-02-14 18:38:20 +01:00
Viktor Lofgren
fab36d6e63 (converter) Loader for reddit data
Adds experimental sideloading support for pusshift.io style reddit data.  This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes.  Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code.  If these can not be found, the tests will shortcircuit as OK.  They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy.
2024-02-14 17:35:44 +01:00
Viktor Lofgren
3d54879c14 (API, minor) Clean up comments. 2024-02-14 12:09:16 +01:00
Viktor Lofgren
e17fcde865 (API, minor) Remove unnecessary inject. 2024-02-14 12:05:50 +01:00
Viktor Lofgren
6950dffcb4 (API) Fix result order in API results
These results should be presented in the same order as their ranking score.
2024-02-14 11:47:14 +01:00
Viktor Lofgren
02dd5c5853 (converter) Look at properties when deciding pool size
Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter.

If true, a much more conservative default is used, limiting the risk of running out of memory.
2024-02-12 16:24:19 +01:00
Viktor Lofgren
5a1087dbf9 (qs-gui) Update documentation, add param for domain limit 2024-02-12 16:13:48 +01:00
Viktor Lofgren
7564dfeb7a (minor) Correct link in documentation for app services 2024-02-12 15:55:06 +01:00
Viktor Lofgren
10bad635a8 (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 20:00:11 +01:00
Viktor Lofgren
7cc8b0fed5 (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 19:58:55 +01:00
Viktor Lofgren
a77846373b (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 19:48:55 +01:00
Viktor Lofgren
bcd0dabb92 (search) Experimental support for clustering search results
Adds experimental support for clustering search results by e.g. domain.  At a first stage, this is only enabled for the wiki and forum filters.

The commit also cleans up the UrlDetails class, which contained a number of vestigial entries.
2024-02-11 17:31:38 +01:00
Viktor Lofgren
9d68062553 (converter) Make processing pool size configurable 2024-02-10 20:59:08 +01:00
Viktor Lofgren
e66d0b7431 (warc) Minor code clean-up.
Remove redundant String$getBytes().  This is mainly an improvement in code consistency.
2024-02-10 18:30:33 +01:00
Viktor Lofgren
ba26f6ce84 (doc) Documentation corrections 2024-02-10 14:16:01 +01:00
Viktor Lofgren
929caed0b9 (warc) Improve WARC standard adherence
The WARC specification says the records should transparently remove compression.  This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
2024-02-09 20:07:01 +01:00
Viktor Lofgren
8340aa2b6c (warc) Improve WARC standard adherence
The WARC specification says the records should transparently remove compression.  This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
2024-02-09 17:29:21 +01:00
Viktor Lofgren
1188fe3bf0 (conf) Improve naming consistency
Rename the property system.conserve-memory to system.conserveMemory in order to be consistent with other properties in the system.
2024-02-09 14:43:08 +01:00
Viktor Lofgren
b15f47d80e (db) Retire the EC_DOMAIN_LINK table
Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.
2024-02-08 15:52:30 +01:00
Viktor Lofgren
ef261cbbd7 (search) Remove stray spaces in bang commands 2024-02-08 14:46:18 +01:00
Viktor
06997ff255
Merge pull request #78 from conor-f/patch-1
(search) Fix broken !ddg handling
2024-02-08 13:45:38 +01:00
Conor Flynn
9d7df87886
(search) Fix broken !ddg handling
https://duckduckgo.com/search?q=asdf leads to running a search for the term "search" instead of "asdf".

Both https://duckduckgo.com/<query> and https://duckduckgo.com/?q=<query> are accepted, but using GET vars seemed more in-keeping with the code.
2024-02-08 13:28:02 +01:00
Viktor Lofgren
a4b2323ca3 (search) Change default search profile to No Filter
Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.
2024-02-08 13:04:05 +01:00
Viktor
e8de468b0b
Make executor API talk GRPC (#75)
* (executor-api) Make executor API talk GRPC

The executor's REST API was very fragile and annoying to work with, lacking even basic type safety.  Migrate to use GRPC instead.  GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil.  This is a fairly straightforward change, but it's also large so a solid round of testing is needed...

The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.

ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().

The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
2024-02-08 13:01:12 +01:00