Commit Graph

40 Commits

Author SHA1 Message Date
Viktor Lofgren
ffa0366deb (minor) Fix typo in ActorStateMachine's logging 2023-08-28 16:11:52 +02:00
Viktor Lofgren
3101b74580 (index) Move to a lexicon-free index design
This is a system-wide change.  The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table.   This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.

The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.

It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
fca62f261e (mq) Down-tune polling intervals in MQ
Polling 10 times a second across dozens of queues is a bit too aggressive and wasteful.
2023-08-22 11:49:30 +02:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
4404ad98ae (mq) Fix missing @Inject that broke everything in control-service 2023-08-15 11:22:12 +02:00
Viktor Lofgren
e7192a9cad (mq) Refactor mq and actor library and move it to libraries out of common 2023-08-15 10:53:23 +02:00
Viktor Lofgren
d6b07e4d01 (controller) Improve the storage interface 2023-07-21 19:56:16 +02:00
Viktor Lofgren
995657c6ce (big-string) Make big-string disable:able 2023-07-21 19:50:35 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
77f2ca51af Optimize SentenceExtractor.
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
ffcbc6c1c9 Reduce the odds of re-allocation by AsciiFlattener 2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5 Use fixed buffers for BigString compression and decompression to reduce GC churn.
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
d82a858491 Don't consider slash to be a sentence separator. 2023-05-31 16:54:30 +02:00
Viktor Lofgren
4e9e79454f Fix broken transformation functions in the PagingArray classes. 2023-05-28 13:31:05 +02:00
Viktor Lofgren
b0bc07b4e7 Insertion sort was *super* busted I don't even know how it worked 2023-05-28 12:17:50 +02:00
Viktor Lofgren
6814c90625 Fix N-width sorting bug 2023-05-28 11:57:06 +02:00
Viktor
96bac70b85
Tools for merging sorted lists, and merging btrees. (#14)
* Utilities for merging BTrees of entity size 1 and 2.
* Isolate and clean up sorting algorithms. 
* Functions for keeping distinct items in a LongArray
2023-04-20 15:28:09 +02:00
Viktor
a278fc6296
Increase search result relevance (#8)
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
32b9c2e671 Fix SentenceExtractor jank 2023-03-30 15:45:04 +02:00
Viktor Lofgren
0fcb2b534c Polish Names 2023-03-29 16:51:47 +02:00
Viktor Lofgren
3464ca514b Fix typeahead suggestions 2023-03-25 10:20:52 +01:00
Viktor Lofgren
611ba2d35a Break apart WordPatterns class 2023-03-22 15:10:17 +01:00
Viktor
1b9ae7b42d
Update readme.md 2023-03-21 16:38:39 +01:00
Viktor Lofgren
1bb1248ab0 Optimize array library, jmh benchmarks. 2023-03-21 16:02:31 +01:00
vlofgren
55d0fa61d7
Update readme.md 2023-03-20 16:39:15 +01:00
Viktor Lofgren
2eb972dea1 Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076 Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
Viktor Lofgren
0ecab53635 Yet more restructuring. 2023-03-13 23:40:26 +01:00
Viktor Lofgren
d82532b7f1 More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00
Viktor Lofgren
281f1322a9 Clean up BTreeWriter 2023-03-12 12:49:49 +01:00
Viktor Lofgren
73eaa0865d The refactoring will continue until morale improves. 2023-03-12 10:50:31 +01:00
Viktor Lofgren
616effdb3c The refactoring will continue until morale improves. 2023-03-12 10:04:48 +01:00
Viktor Lofgren
6d939175b1 Additional code restructuring to get rid of util and misc-style packages. 2023-03-11 13:48:40 +01:00
Viktor Lofgren
722ff3bffb Word feature bit for words that appear in the URL, new search profile for plain text files, better plain text titles. 2023-03-10 16:46:56 +01:00
Viktor Lofgren
2bc212d65c Refactor DocumentKeyword-related classes 2023-03-09 20:41:38 +01:00
Viktor Lofgren
efb46cc703 Remove count from WordMetadata entirely. 2023-03-09 18:14:14 +01:00
Viktor Lofgren
9ece07d559 Chasing a result ranking bug 2023-03-09 17:52:35 +01:00
Viktor Lofgren
ad1be7c835 Move all code to a code directory. 2023-03-07 17:14:32 +01:00