Commit Graph

24 Commits

Author SHA1 Message Date
Viktor Lofgren
3101b74580 (index) Move to a lexicon-free index design
This is a system-wide change.  The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table.   This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.

The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.

It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor
0f9b90eb1c
Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
77f2ca51af Optimize SentenceExtractor.
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
ffcbc6c1c9 Reduce the odds of re-allocation by AsciiFlattener 2023-06-19 17:58:19 +02:00
Viktor Lofgren
d82a858491 Don't consider slash to be a sentence separator. 2023-05-31 16:54:30 +02:00
Viktor
a278fc6296
Increase search result relevance (#8)
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
32b9c2e671 Fix SentenceExtractor jank 2023-03-30 15:45:04 +02:00
Viktor Lofgren
0fcb2b534c Polish Names 2023-03-29 16:51:47 +02:00
Viktor Lofgren
611ba2d35a Break apart WordPatterns class 2023-03-22 15:10:17 +01:00
Viktor Lofgren
1bb1248ab0 Optimize array library, jmh benchmarks. 2023-03-21 16:02:31 +01:00
Viktor Lofgren
2eb972dea1 Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076 Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
Viktor Lofgren
0ecab53635 Yet more restructuring. 2023-03-13 23:40:26 +01:00
Viktor Lofgren
d82532b7f1 More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00
Viktor Lofgren
616effdb3c The refactoring will continue until morale improves. 2023-03-12 10:04:48 +01:00
Viktor Lofgren
6d939175b1 Additional code restructuring to get rid of util and misc-style packages. 2023-03-11 13:48:40 +01:00
Viktor Lofgren
722ff3bffb Word feature bit for words that appear in the URL, new search profile for plain text files, better plain text titles. 2023-03-10 16:46:56 +01:00
Viktor Lofgren
2bc212d65c Refactor DocumentKeyword-related classes 2023-03-09 20:41:38 +01:00
Viktor Lofgren
efb46cc703 Remove count from WordMetadata entirely. 2023-03-09 18:14:14 +01:00
Viktor Lofgren
9ece07d559 Chasing a result ranking bug 2023-03-09 17:52:35 +01:00
Viktor Lofgren
ad1be7c835 Move all code to a code directory. 2023-03-07 17:14:32 +01:00