CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	6c3b49417f	(index-query) Improve documentation and code quality	2024-02-15 11:33:50 +01:00
Viktor Lofgren	dcc5cfb7c0	(index-journal) Improve documentation and code quality	2024-02-15 10:51:49 +01:00
Viktor	d970836605	Merge pull request #79 from MarginaliaSearch/reddit (converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.	2024-02-15 09:17:56 +01:00
Viktor Lofgren	8021bd0aae	(control) Sort upload listing results Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename. The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.	2024-02-15 09:13:40 +01:00
Viktor Lofgren	8f91156d80	(control) Improve sideload UX The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable. Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc. It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.	2024-02-14 18:38:20 +01:00
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00
Viktor Lofgren	3d54879c14	(API, minor) Clean up comments.	2024-02-14 12:09:16 +01:00
Viktor Lofgren	e17fcde865	(API, minor) Remove unnecessary inject.	2024-02-14 12:05:50 +01:00
Viktor Lofgren	6950dffcb4	(API) Fix result order in API results These results should be presented in the same order as their ranking score.	2024-02-14 11:47:14 +01:00
Viktor Lofgren	02dd5c5853	(converter) Look at properties when deciding pool size Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter. If true, a much more conservative default is used, limiting the risk of running out of memory.	2024-02-12 16:24:19 +01:00
Viktor Lofgren	5a1087dbf9	(qs-gui) Update documentation, add param for domain limit	2024-02-12 16:13:48 +01:00
Viktor Lofgren	7564dfeb7a	(minor) Correct link in documentation for app services	2024-02-12 15:55:06 +01:00
Viktor Lofgren	10bad635a8	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 20:00:11 +01:00
Viktor Lofgren	7cc8b0fed5	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 19:58:55 +01:00
Viktor Lofgren	a77846373b	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 19:48:55 +01:00
Viktor Lofgren	bcd0dabb92	(search) Experimental support for clustering search results Adds experimental support for clustering search results by e.g. domain. At a first stage, this is only enabled for the wiki and forum filters. The commit also cleans up the UrlDetails class, which contained a number of vestigial entries.	2024-02-11 17:31:38 +01:00
Viktor Lofgren	9d68062553	(converter) Make processing pool size configurable	2024-02-10 20:59:08 +01:00
Viktor Lofgren	e66d0b7431	(warc) Minor code clean-up. Remove redundant String$getBytes(). This is mainly an improvement in code consistency.	2024-02-10 18:30:33 +01:00
Viktor Lofgren	ba26f6ce84	(doc) Documentation corrections	2024-02-10 14:16:01 +01:00
Viktor Lofgren	929caed0b9	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 20:07:01 +01:00
Viktor Lofgren	8340aa2b6c	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 17:29:21 +01:00
Viktor Lofgren	1188fe3bf0	(conf) Improve naming consistency Rename the property system.conserve-memory to system.conserveMemory in order to be consistent with other properties in the system.	2024-02-09 14:43:08 +01:00
Viktor Lofgren	b15f47d80e	(db) Retire the EC_DOMAIN_LINK table Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.	2024-02-08 15:52:30 +01:00
Viktor Lofgren	ef261cbbd7	(search) Remove stray spaces in bang commands	2024-02-08 14:46:18 +01:00
Viktor	06997ff255	Merge pull request #78 from conor-f/patch-1 (search) Fix broken !ddg handling	2024-02-08 13:45:38 +01:00
Conor Flynn	9d7df87886	(search) Fix broken !ddg handling https://duckduckgo.com/search?q=asdf leads to running a search for the term "search" instead of "asdf". Both https://duckduckgo.com/<query> and https://duckduckgo.com/?q=<query> are accepted, but using GET vars seemed more in-keeping with the code.	2024-02-08 13:28:02 +01:00
Viktor Lofgren	a4b2323ca3	(search) Change default search profile to No Filter Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.	2024-02-08 13:04:05 +01:00
Viktor	e8de468b0b	Make executor API talk GRPC (#75 ) * (executor-api) Make executor API talk GRPC The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed... The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients. ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name(). The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.	2024-02-08 13:01:12 +01:00
Viktor Lofgren	d83a3bf4e2	(search) Fix broken !w handling Printf format error derp.	2024-02-08 12:11:33 +01:00
Viktor Lofgren	f2b39ad055	(search) Fix broken !bang handling !bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken. It's now repaired, and a test is in place to ensure we know if it breaks again.	2024-02-08 12:05:09 +01:00
Viktor Lofgren	95d1bd98e4	(array) Update documentation, make unsafe configurable The readme for the array library was extremely out of date. Updating it with accurate information about how the library works, and a demo that should compile. Also added a system property for disabling the use of sun.misc.Unsafe.	2024-02-07 12:26:47 +01:00
Viktor Lofgren	8acbc6a6b4	(index-construction) Split repartition into two actions cont'd Continues `467ba5be20` by breaking out a constant with the name of the primary ranking set. Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.	2024-02-06 19:54:17 +01:00
Viktor Lofgren	467ba5be20	(index-construction) Split repartition into two actions This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after... To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one. The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader. Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data. Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead. To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.	2024-02-06 17:20:07 +01:00
Viktor Lofgren	29ddf9e61d	(doc) Update docs	2024-02-06 16:29:55 +01:00
Viktor Lofgren	92e119cab3	(doc) Update docs	2024-02-06 12:43:42 +01:00
Viktor Lofgren	92049ba8e4	(doc) Update docs	2024-02-06 12:41:28 +01:00
Viktor Lofgren	54330b9921	(*) Remove dead code	2024-02-06 12:41:13 +01:00
Viktor Lofgren	d1aeb030f2	(doc) Update RandomWriteFunnel documentation	2024-02-06 12:35:24 +01:00
Viktor Lofgren	f89274d1ea	(minor) Fix broken test Fallout from changes in endianness made in `d986f90074`	2024-02-06 12:12:26 +01:00
Viktor Lofgren	7286596fb4	(deps) Remove monkey patched GSON The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data. Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.	2024-02-06 12:11:39 +01:00
Viktor Lofgren	a2fc83d94e	(control) Add configurable border styling To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.	2024-02-06 12:05:02 +01:00
Viktor Lofgren	2161799cc3	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:18:00 +01:00
Viktor Lofgren	c88f132057	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:10:03 +01:00
Viktor Lofgren	c6313a5906	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:06:36 +01:00
Viktor Lofgren	eadcdb5bed	(minor) Improve error handling, naming logging in IndexResultDecorator	2024-02-05 21:05:44 +01:00
Viktor Lofgren	6e7649b5f7	(loader) Mitigate fragile paging behavior IndexJournalWriterPagingImpl was modified to not page on number of entries written, but number of (equivalent uncompressed) bytes written. Since the failure mode if too much data is written per file is quiet corruption of the index, the former behavior was extremely fragile. The new behavior should consistently ensure that the data is sufficiently small to not cause any integer rollovers. The change in `6dcc20038c` was reverted, as there is really no sane reason to have this configurable in software.	2024-02-05 21:05:03 +01:00
Viktor Lofgren	d986f90074	(index) Fix consistency between RandomFileAssembler implementations The RandomFileAssembler implementations, introduced in commit `53c575db3f` were all acting subtly differently. The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary. A test was built to ensure the output of these implementations is equivalent.	2024-02-05 21:01:32 +01:00
Viktor Lofgren	53c575db3f	(index-construction) Make random-write file strategy configurable To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing. By default, the data is just buffered in RAM. This works well on a large server, but smaller systems struggle. To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true. RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time. This is relatively slow and uses more than twice the disk size. A new interface RandomFileAssembler is introduced as an abstraction for this operation. A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB). In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.	2024-02-05 12:31:15 +01:00
Viktor Lofgren	6dcc20038c	(index-journal) Make index journal page size configurable Adds a new system property loader.journal-page-size to configure this setting.	2024-02-05 11:26:05 +01:00
Viktor Lofgren	fa145f632b	(sideload) Add special handling for sideloaded wiki documents This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.	2024-02-02 21:22:07 +01:00

1 2 3 4 5 ...

1732 Commits