CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	88ac72c8eb	(journal/reverse index) Working WIP fix over-allocation of documents	2023-08-31 20:16:02 +02:00
Viktor Lofgren	a6f1335375	(loader) Fix bugfix where the loader would omit some meta and words.	2023-08-31 17:48:43 +02:00
Viktor Lofgren	764e7d1315	(index) Add more comprehensive integration tests for the index service.	2023-08-30 10:37:24 +02:00
Viktor Lofgren	dd593c292c	(loader) Minor optimizations and bugfixes. * Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well * Remove remains of OldDomains * Ensure LOADER_PROCESS_OPTS gets fed to the processes * LinkdbStatusWriter won't execute batch after each added item post 100 items	2023-08-29 15:37:52 +02:00
Viktor Lofgren	39c1857c61	(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.	2023-08-29 13:07:55 +02:00
Viktor Lofgren	a2e6616100	(index-reverse) Add documentation and clean up code.	2023-08-29 11:35:54 +02:00
Viktor Lofgren	6525b16e1f	(minor) Improved logging and error messages	2023-08-28 19:53:55 +02:00
Viktor Lofgren	b6a92506d1	(index) Hook in missing DocIdRewriter This enables documents to be ranked properly.	2023-08-28 19:53:43 +02:00
Viktor Lofgren	00c4686ef0	(reverse-index) Fix over-allocation of the count array in merging	2023-08-28 14:36:28 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	460998d512	(index) Move index construction to separate process. This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D	2023-08-25 12:52:54 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	9894f37412	(index) Implement new URL ID coding scheme. Also refactor along the way. Really needs an additional pass, these tests are very hairy.	2023-08-24 16:44:27 +02:00
Viktor Lofgren	6a04cdfddf	(loader) Implement new linkdb in loader Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal. For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.	2023-08-24 13:07:54 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	1a05cba60a	(keyword lexicon) Use three hash tables to increase the possible number of keywords to 2^31 from 0.75 x 2^30.	2023-08-23 11:25:20 +02:00
Viktor Lofgren	704de50a9b	(forward-index, valuator) HTML features in valuator Put it in the forward index for easy access during index-side valuation.	2023-08-18 11:54:56 +02:00
Viktor Lofgren	251fc63b42	(*) Fix merge gore	2023-08-09 13:33:28 +02:00
Viktor Lofgren	624b78ec3a	(heartbeat) Task heartbeats	2023-08-04 14:40:06 +02:00
Viktor Lofgren	b08e302dd5	(lexicon) Optimize lexicon by using Murmur3_128's hash function	2023-08-01 15:02:13 +02:00
Viktor Lofgren	ea66195b97	(loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash	2023-08-01 15:02:13 +02:00
Viktor Lofgren	9288d311d4	Add buffering to index journal writer	2023-07-28 18:11:19 +02:00
Viktor Lofgren	d7ab21fe34	(*) Refactor Control Service and processes	2023-07-17 21:20:31 +02:00
Viktor Lofgren	88b9ec70c6	(control, WIP) Run reconvert-load from converter :D	2023-07-11 18:05:37 +02:00
Viktor Lofgren	55c65f0935	Use document generator to complement the document selection. Will let through e.g. a modern SSG in the small web filter.	2023-06-22 17:21:33 +02:00
Viktor Lofgren	ccc41d1717	Clean up of the index query handling related code.	2023-04-10 14:50:57 +02:00
Viktor Lofgren	e49b1dd155	Better handling of quote terms, fix bug in handling of longer queries. ... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java	2023-04-10 13:20:40 +02:00
Viktor	a278fc6296	Increase search result relevance (#8 ) * Increase accuracy of the position bits. * Increase their width to 56. * Use a rolling position scheme for bits 16-56 to increase the average accuracy. * Result ranking overhaul * Optimized queries * BM25 in the index service's ranking * Make gui less jank * Javadocs for ranking parameters.	2023-04-07 20:18:08 +02:00
Viktor Lofgren	105d93cd85	Index query builder automatically ignores redundant predicates.	2023-04-02 12:04:26 +02:00
Viktor Lofgren	1e4157017d	More helpful descriptions of index queries.	2023-04-02 12:03:58 +02:00
Viktor Lofgren	dcf6218cdb	Fix bugs related to search result selection in the case with multiple search terms. * A deduplication filter step ran too early, and removed many good results on the basis that they partially, but did not fully fit another set of search terms. * Altered the query creation process to prefer documents where multiple terms appear in the priority index.	2023-03-29 15:18:52 +02:00
Viktor Lofgren	30584887f9	DictionaryMap changes. Add new flag to change the default size to make prod index boot faster. Remove option to select OffHeapDictionaryHashMap.	2023-03-27 17:28:39 +02:00
Viktor	ac1ac3ea57	Move database to a separate module * Move database to a separate project, break apart sql file into separate entities. * Fix front page news listing.	2023-03-25 15:26:17 +01:00
Viktor	45dd9fea25	Update readme.md	2023-03-22 17:15:36 +01:00
Viktor	c974d72e7e	Update readme.md	2023-03-22 17:09:48 +01:00
Viktor	ecd6ed186f	Update readme.md	2023-03-21 17:33:02 +01:00
Viktor	b07f84bc01	Update readme.md	2023-03-21 17:32:09 +01:00
Viktor Lofgren	46f81aca2f	Break apart reverse index into a separate full index and priority index. It did this before using the same code. This will make the priority index about half as big since it no longer needs to keep metadata.	2023-03-21 16:12:31 +01:00
vlofgren	29c76fcdce	Add page&brin to domain-ranking readme.md	2023-03-20 16:41:34 +01:00
vlofgren	554a7fde80	Update readme.md	2023-03-20 16:27:37 +01:00
Viktor Lofgren	2eb972dea1	Remove unrelated code, break tools into their own directory.	2023-03-17 16:03:11 +01:00
Viktor Lofgren	449471a076	Yet more restructuring. Improved search result ranking.	2023-03-16 21:35:54 +01:00
Viktor Lofgren	0ecab53635	Yet more restructuring.	2023-03-13 23:40:26 +01:00
Viktor Lofgren	d82532b7f1	More restructuring, big bug fixes in keyword extraction.	2023-03-13 17:39:53 +01:00
Viktor Lofgren	8b8fc49901	The refactoring will continue until morale improves.	2023-03-12 11:42:07 +01:00
Viktor Lofgren	73eaa0865d	The refactoring will continue until morale improves.	2023-03-12 10:50:31 +01:00

1 2

97 Commits