CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	1694b4d6ef	(valuation) Increase the penalty for adtech a bit	2024-01-05 13:21:34 +01:00
Viktor Lofgren	396299c1db	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-05 13:21:33 +01:00
Viktor Lofgren	71d789aab0	(index) Tweak result valuation renormalization	2024-01-05 13:21:33 +01:00
Viktor Lofgren	d2418521a7	(index) Further ranking adjustments	2024-01-02 12:35:59 +01:00
Viktor Lofgren	9330b5b1d9	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	4763077b76	(search/index) Add a new keyword "count" This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.	2023-12-25 20:38:29 +01:00
Viktor Lofgren	0b8dc02eba	(result-ranking) Nudge up results with ngram matches a tiny bit	2023-11-06 13:14:22 +01:00
Viktor Lofgren	48986574ae	(result-ranking) Use a weighted calculation of priority term importance	2023-11-06 12:56:21 +01:00
Viktor Lofgren	c7a6a71d07	(result-ranking) Use a weighted calculation of priority term importance	2023-11-06 12:48:23 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	37b7f52f2c	(minor) Reduce log severity for getTermMeta miss	2023-10-26 15:41:52 +02:00
Viktor Lofgren	c89e0ab255	(minor) Disable ~vlofgren specific debug test	2023-10-26 15:27:59 +02:00
Viktor Lofgren	f613f4f2df	(array) Fix spurious search results This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss. It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.	2023-10-26 15:27:02 +02:00
Viktor Lofgren	313cc2965c	(index-creation) Print whether full or prio is created Previous state of saying reverse index for both was pretty confusing.	2023-10-24 16:23:10 +02:00
Viktor Lofgren	9e26109e36	(reverse-index) Don't always POST	2023-10-14 16:48:29 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	3889c4bdd9	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00
Viktor Lofgren	97e17282ab	(query-service) Move query parsing from search-service to the new query service.	2023-10-09 13:27:44 +02:00
Viktor	8e1abc3f10	(index-reverse) Parallel construction of the reverse indexes. (#52 ) * (index-reverse) Parallel construction of the reverse indexes. * (array) Remove wasteful calculation of numDistinct before merging two sorted arrays. * (index-reverse) Force changes to disk on close, reduce logging. * (index-reverse) Clean up merging process and add back logging * (run) Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM * (index-reverse) Better logging during processing * (array) 2GB+ compatible write() function * (array) 2GB+ compatible write() function * (index-reverse) We are logging like Bolsonaro and I will not have it. * (reverse-index) Self-diagnostics * (btree) Fix bug in btree reader to do with large data sizes	2023-10-07 10:00:00 +02:00
Viktor Lofgren	e498c6907a	(forward-index) Don't leak off heap memory	2023-10-05 21:22:13 +02:00
Viktor Lofgren	08e8fc6736	(index-journal) Thread safe IndexJournalReadEntry	2023-10-05 19:39:09 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	40768e935b	(test) Removing /tmp-guardrails as it doesn't hold in CI	2023-10-02 16:52:59 +02:00
Viktor Lofgren	cd12f49fc0	(long-array) Return slices SegmentLongArray of itself for range() &c	2023-09-24 11:31:54 +02:00
Viktor Lofgren	5f6c3da7a4	(index) Add close methods on the index readers so they clean up their mmaps	2023-09-24 10:54:23 +02:00
Viktor Lofgren	d0aa754252	(long-array) Implement java.lang.foreign.Arena based lifecycle control for LongArray. Further de-ByteBuffer:ing of these classes is to be done, but this is the smallest most urgently needed benefit. This commit is a WIP but in a fully working state, pushing due to the importance of the changes to offer lifecycle control over mmaps.	2023-09-24 10:40:06 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	bafc2a1f30	(reverse-index) Force() final docs after being written Unlikely to be a problem, but we want to ensure it's on dsik before we go read it later.	2023-09-01 15:43:53 +02:00
Viktor Lofgren	563e388a45	(reverse-index) Fix parallel documents sorting bug Bug was caused by parallel sorting capturing the iterator rather than the offsets to sort.	2023-09-01 15:42:45 +02:00
Viktor Lofgren	d31d8ec5b0	(index) Log keyword ids on hex format	2023-09-01 15:40:24 +02:00
Viktor Lofgren	10a74f45ea	(index journal; minor) Even cleaner separation of concerns.	2023-09-01 11:28:02 +02:00
Viktor Lofgren	320dad7f1a	(index journal) Fix leaky abstraction in IndexJournalReader. The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.	2023-09-01 11:18:13 +02:00
Viktor Lofgren	88ac72c8eb	(journal/reverse index) Working WIP fix over-allocation of documents	2023-08-31 20:16:02 +02:00
Viktor Lofgren	a6f1335375	(loader) Fix bugfix where the loader would omit some meta and words.	2023-08-31 17:48:43 +02:00
Viktor Lofgren	764e7d1315	(index) Add more comprehensive integration tests for the index service.	2023-08-30 10:37:24 +02:00
Viktor Lofgren	dd593c292c	(loader) Minor optimizations and bugfixes. * Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well * Remove remains of OldDomains * Ensure LOADER_PROCESS_OPTS gets fed to the processes * LinkdbStatusWriter won't execute batch after each added item post 100 items	2023-08-29 15:37:52 +02:00
Viktor Lofgren	39c1857c61	(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.	2023-08-29 13:07:55 +02:00
Viktor Lofgren	a2e6616100	(index-reverse) Add documentation and clean up code.	2023-08-29 11:35:54 +02:00
Viktor Lofgren	6525b16e1f	(minor) Improved logging and error messages	2023-08-28 19:53:55 +02:00
Viktor Lofgren	b6a92506d1	(index) Hook in missing DocIdRewriter This enables documents to be ranked properly.	2023-08-28 19:53:43 +02:00
Viktor Lofgren	00c4686ef0	(reverse-index) Fix over-allocation of the count array in merging	2023-08-28 14:36:28 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	460998d512	(index) Move index construction to separate process. This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D	2023-08-25 12:52:54 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	9894f37412	(index) Implement new URL ID coding scheme. Also refactor along the way. Really needs an additional pass, these tests are very hairy.	2023-08-24 16:44:27 +02:00
Viktor Lofgren	6a04cdfddf	(loader) Implement new linkdb in loader Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal. For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.	2023-08-24 13:07:54 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	1a05cba60a	(keyword lexicon) Use three hash tables to increase the possible number of keywords to 2^31 from 0.75 x 2^30.	2023-08-23 11:25:20 +02:00
Viktor Lofgren	704de50a9b	(forward-index, valuator) HTML features in valuator Put it in the forward index for easy access during index-side valuation.	2023-08-18 11:54:56 +02:00

1 2

79 Commits