CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor Lofgren	edc1acbb7e	(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.	2024-01-08 15:53:13 +01:00
Viktor Lofgren	5d1b7da728	Updated site info feed and search service Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.	2023-12-26 22:06:01 +01:00
Viktor Lofgren	1694e9c78c	(search) Add RSS Feeds to site info This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates. The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.	2023-12-26 16:21:40 +01:00
Viktor Lofgren	bf44805e69	(*) Rename EdgeDomain$domain into topDomain This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time. Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.	2023-12-17 14:00:07 +01:00
Viktor Lofgren	8a1934008c	(search) Merge similar sites results with the info view. WIP: This commit needs to be cleaned up.	2023-12-04 22:10:24 +01:00
Viktor Lofgren	97d43a6fa2	(search) Revamp browse results with new look.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	3889c4bdd9	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00
Viktor Lofgren	97e17282ab	(query-service) Move query parsing from search-service to the new query service.	2023-10-09 13:27:44 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	6f222b9800	(search) Add refresh link to explore mode. This is a QOL improvement for mobile users, who otherwise would have to scroll all the way up to refresh. Also removed the confusing "this is a random set of domains"-message when viewing adjacent websites, as it's not random.	2023-08-22 12:43:44 +02:00
Viktor Lofgren	704de50a9b	(forward-index, valuator) HTML features in valuator Put it in the forward index for easy access during index-side valuation.	2023-08-18 11:54:56 +02:00
Viktor Lofgren	46d761f34f	(language) fasttext based language filter	2023-08-16 15:48:12 +02:00
Viktor Lofgren	4598c7f40f	(valuation) Penalize wordpress style kebab case urls	2023-08-16 13:11:24 +02:00
Viktor Lofgren	ae9537b68e	(search) Fix a bug where space-like characters weren't normalized in query processing.	2023-07-07 20:02:05 +02:00
Viktor Lofgren	f12c6fd57e	Add a ranking parameter for biasing toward recent or old content.	2023-04-20 16:00:59 +02:00
Viktor Lofgren	4d298cd5fa	Improving screenshots capture bot.	2023-04-17 18:04:22 +02:00
Viktor Lofgren	2ab26f37b8	Bug fix for document metadata encoding that breaks year based queries.	2023-04-14 16:56:49 +02:00
Viktor Lofgren	3e9b37c264	Refactor website screenshot tool and website adjacencies calculator into code/tools.	2023-04-11 16:20:27 +02:00
Viktor	a278fc6296	Increase search result relevance (#8 ) * Increase accuracy of the position bits. * Increase their width to 56. * Use a rolling position scheme for bits 16-56 to increase the average accuracy. * Result ranking overhaul * Optimized queries * BM25 in the index service's ranking * Make gui less jank * Javadocs for ranking parameters.	2023-04-07 20:18:08 +02:00
Viktor Lofgren	cc4e089a5d	Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc.	2023-03-30 15:46:15 +02:00
Viktor	ac1ac3ea57	Move database to a separate module * Move database to a separate project, break apart sql file into separate entities. * Fix front page news listing.	2023-03-25 15:26:17 +01:00
Viktor Lofgren	3464ca514b	Fix typeahead suggestions	2023-03-25 10:20:52 +01:00
Viktor	ad2e939018	Update readme.md	2023-03-21 17:30:44 +01:00
Viktor Lofgren	1bb1248ab0	Optimize array library, jmh benchmarks.	2023-03-21 16:02:31 +01:00
Viktor Lofgren	2eb972dea1	Remove unrelated code, break tools into their own directory.	2023-03-17 16:03:11 +01:00
Viktor Lofgren	449471a076	Yet more restructuring. Improved search result ranking.	2023-03-16 21:35:54 +01:00
Viktor Lofgren	5ef17a2a20	Yet more restructuring.	2023-03-13 23:43:09 +01:00
Viktor Lofgren	0ecab53635	Yet more restructuring.	2023-03-13 23:40:26 +01:00
Viktor Lofgren	d82532b7f1	More restructuring, big bug fixes in keyword extraction.	2023-03-13 17:39:53 +01:00
Viktor Lofgren	73eaa0865d	The refactoring will continue until morale improves.	2023-03-12 10:50:31 +01:00

36 Commits