CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	9ec262ae00	(domain-ranking) Integrate new ranking logic The change deprecates the 'algorithm' field from the domain ranking set configuration. Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.	2024-02-16 20:22:01 +01:00
Viktor Lofgren	64acdb5f2a	(domain-ranking) Clean up domain ranking The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable. Migrating over to use JGraphT to store the link graph when doing rankings, and using their PageRank implementation. Also added a modified version that does PersonalizedPageRank.	2024-02-16 18:04:58 +01:00
Viktor Lofgren	66b3e71e56	(search) Expose more search options This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias. The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period. These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well. The vintage filter is modified to add a temporal bias for the past.	2024-02-15 13:39:51 +01:00
Viktor Lofgren	dcc5cfb7c0	(index-journal) Improve documentation and code quality	2024-02-15 10:51:49 +01:00
Viktor Lofgren	8021bd0aae	(control) Sort upload listing results Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename. The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.	2024-02-15 09:13:40 +01:00
Viktor Lofgren	8f91156d80	(control) Improve sideload UX The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable. Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc. It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.	2024-02-14 18:38:20 +01:00
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00
Viktor Lofgren	5a1087dbf9	(qs-gui) Update documentation, add param for domain limit	2024-02-12 16:13:48 +01:00
Viktor Lofgren	b15f47d80e	(db) Retire the EC_DOMAIN_LINK table Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.	2024-02-08 15:52:30 +01:00
Viktor	e8de468b0b	Make executor API talk GRPC (#75 ) * (executor-api) Make executor API talk GRPC The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed... The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients. ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name(). The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.	2024-02-08 13:01:12 +01:00
Viktor Lofgren	8acbc6a6b4	(index-construction) Split repartition into two actions cont'd Continues `467ba5be20` by breaking out a constant with the name of the primary ranking set. Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.	2024-02-06 19:54:17 +01:00
Viktor Lofgren	467ba5be20	(index-construction) Split repartition into two actions This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after... To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one. The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader. Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data. Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead. To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.	2024-02-06 17:20:07 +01:00
Viktor Lofgren	92e119cab3	(doc) Update docs	2024-02-06 12:43:42 +01:00
Viktor Lofgren	92049ba8e4	(doc) Update docs	2024-02-06 12:41:28 +01:00
Viktor Lofgren	a2fc83d94e	(control) Add configurable border styling To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.	2024-02-06 12:05:02 +01:00
Viktor Lofgren	2161799cc3	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:18:00 +01:00
Viktor Lofgren	c88f132057	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:10:03 +01:00
Viktor Lofgren	c6313a5906	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:06:36 +01:00
Viktor Lofgren	eadcdb5bed	(minor) Improve error handling, naming logging in IndexResultDecorator	2024-02-05 21:05:44 +01:00
Viktor Lofgren	52a0255814	() Add flag for disabling ASCII flattening The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild. Adding an experimental* system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior. IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.	2024-01-31 11:50:59 +01:00
Viktor Lofgren	6edc318597	(control) Fix typo in URL linking to new-crawl-specs	2024-01-26 10:43:10 +01:00
Viktor Lofgren	182c0cf28e	(control) Add warnings about domain data contamination	2024-01-25 18:26:15 +01:00
Viktor Lofgren	0b105b5986	(converter) Update hyperlink text for new crawl spec creation. Fix minor typo.	2024-01-25 18:05:11 +01:00
Viktor Lofgren	cae1bad274	(*) Add download-sample action, refactor file storage This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu. It also refactors out some leaky abstractions out of FileStorageService. allocateTemporaryStorage has been renamed allocateStorage. The storage was never temporary in any scenario... It also doesn't take a storage base, as there was always only one valid option for this input. The allocateStorage method finds the appropriate base itself.	2024-01-25 13:36:30 +01:00
Viktor Lofgren	958d64720e	(control) Add a view for restarting aborted processes This will avoid having to dig in the message queue to perform this relatively common task. The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.	2024-01-24 12:47:10 +01:00
Viktor Lofgren	805afad4fe	(control) New GUI for exporting crawl data samples Not going to win any beauty pageants, but this is pretty peripheral functionality.	2024-01-23 17:08:21 +01:00
Viktor Lofgren	ee7792596d	(*) Fix broken test Probably shouldn't have tests depending on external data like this...	2024-01-23 12:03:47 +01:00
Viktor Lofgren	f15dd06473	(index) Delayed close() of SearchIndexReader This avoids concurrent access errors. This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file. If pull the rug from under the caller by closing the file, we'll get a SIGSEGV. Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it. So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up. This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers. Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.	2024-01-23 11:08:41 +01:00
Viktor Lofgren	dd26819d66	(actor) Try to rare data race where a finished job is considered dead.	2024-01-22 21:22:38 +01:00
Viktor Lofgren	a6d257df5b	(converter) Update Stackexchange sideload instruction The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.	2024-01-22 18:29:20 +01:00
Viktor Lofgren	51cdf46645	(control) Improve accessibility in search-to-ban template This update enhances accessibility by associating labels with the corresponding checkboxes in the search-to-ban template.	2024-01-22 15:01:00 +01:00
Viktor Lofgren	40c9d2050f	(control) Fully automatic conversion Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine. Removed the tool itself. This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	b91ea1d7ca	(control) Re-add gui for sideloading dirtrees	2024-01-20 18:09:40 +01:00
Viktor Lofgren	91c7960800	(crawler) Extract additional configuration properties This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties. The documentation is updated to reflect the change. Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.	2024-01-20 10:36:04 +01:00
Viktor Lofgren	2079a5574b	(control) Update heading in restore backup template Changed the heading in the partial restore backup page from "Load" to "Restore Backup".	2024-01-19 21:46:53 +01:00
Viktor Lofgren	27ffb8fa8a	(converter) Integrate zim->db conversion into automatic encyclopedia processing workflow Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically. The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.	2024-01-19 13:59:03 +01:00
Viktor Lofgren	964419803a	Fix broken test	2024-01-18 15:42:01 +01:00
Viktor Lofgren	6271d5d544	(mq) Add relation tracking between MQ messages for easier tracking and debugging. The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers. The existing RELATED_ID field has too many semantics associated with them, among other things the FSM code uses them this field in tracking state changes. The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.	2024-01-18 15:08:27 +01:00
Viktor Lofgren	175bd310f5	(control) Message queue UX improvements	2024-01-18 13:05:50 +01:00
Viktor Lofgren	67ee6f4126	(control) Clean up filtering UX in Events table	2024-01-18 12:35:39 +01:00
Viktor Lofgren	18638c62de	(control) Rephrase text	2024-01-18 11:53:10 +01:00
Viktor Lofgren	753d000788	(control) Add toggle for automatic loading of processed data	2024-01-18 11:52:58 +01:00
Viktor Lofgren	19e781b104	(control) Add basic input validation to node actions Will present a simple error message when required fields aren't populated, instead of a cryptic HTTP status error.	2024-01-18 11:52:49 +01:00
Viktor Lofgren	aa2df327db	(index) Prevent index from attempting to answer queries when no index data is loaded This improves query times, and gets rid of exceptions in the logs when one of the index nodes doesn't have any data loaded, yet is configured to answer queries.	2024-01-18 11:05:45 +01:00
Viktor Lofgren	41cdb8f71b	(control) Fix broken update button in the update-domain-ranking-set form id property was on the wrong element.	2024-01-17 18:21:09 +01:00
Viktor Lofgren	304d4c9acf	(control) Fix result ordering in the file storage listing view In some scenarios, such as when restoring storage items from json-manifest on db failure, the file storage view would present the items in a non-chronological order. Added a sort() operation to mitigate this.	2024-01-17 10:56:30 +01:00
Viktor Lofgren	7fd4c092e3	(control) Clean up UX and accessibility for new domain ranking sets. The change also adds basic support for error messages in the GUI.	2024-01-17 10:47:14 +01:00
Viktor Lofgren	2fe5705542	(control) GUI for ranking sets Still missing is some polish, forms don't have proper labels, validation is inconsistent, no error messages, etc.	2024-01-16 17:10:09 +01:00
Viktor Lofgren	e968365858	(index) Use new DomainRankingSets to configure ranking algos in index svc	2024-01-16 12:43:32 +01:00
Viktor Lofgren	5a62b3058f	(query-api) Make the search set identifier a string value in the API This will free the core marginalia search engine to use arbitrary search set definitions, while the app can use its hardcoded defaults.	2024-01-16 10:55:24 +01:00

1 2 3 4 5 ...

352 Commits