Commit Graph

611 Commits

Author SHA1 Message Date
Viktor Lofgren
fa9b4e4352 A tiny release between crawls (#138)
Bringing online new ranking changes

Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/138
2023-02-12 10:57:07 +01:00
Viktor Lofgren
6ef9f13c68 merge release into master 2023-02-12 10:53:51 +01:00
Viktor Lofgren
db50ca2231 Tidy up RankingSearchSet 2023-02-12 10:47:46 +01:00
Viktor Lofgren
4d9dce5733 Tidy up RankingSearchSet 2023-02-12 10:45:35 +01:00
Viktor Lofgren
bcadfc965d Use new cosine-similarity ranking algorithm 2023-02-12 10:28:53 +01:00
Viktor Lofgren
3e1297064c Tidy up code 2023-02-11 13:06:40 +01:00
Viktor Lofgren
06df8e9a28 Sort the index on rank to, like the previous design, prioritize the discovery of high ranking items. 2023-02-11 12:17:30 +01:00
Viktor Lofgren
e963ecb4ae Modified the ranking algorithm to be able to pagerank with similarity data instead of the link graph. 2023-02-07 22:13:25 +01:00
Viktor Lofgren
04f905f3a1 Reintroduce the ability to filter search results by their ranking. 2023-02-04 12:59:24 +01:00
Viktor Lofgren
4a07eda61c Debug query strategy options 2023-02-02 10:35:55 +01:00
Viktor Lofgren
b18cd0bc36 Improvements to array library and conversion 2023-02-02 10:35:14 +01:00
Viktor Lofgren
cdaeb7724a Clean up braille punch cards 2023-02-02 10:34:17 +01:00
Viktor Lofgren
e3bea19d4d Improvements to array library 2023-02-02 10:33:16 +01:00
Viktor Lofgren
8168d512b8 Retire defunct SMHI weather forecast integration. 2023-01-30 13:25:41 +01:00
Viktor Lofgren
4c2f54593e Use on-heap dictionary for small data. 2023-01-30 13:10:56 +01:00
Viktor Lofgren
4a6a1308b0 Remove min length regex, the guard is too weak to be meaningful 2023-01-30 10:43:53 +01:00
Viktor Lofgren
2e4532ca90 Clean up KeywordMetadata 2023-01-30 10:22:43 +01:00
Viktor Lofgren
d5df3268b3 Update 3rd party readme 2023-01-30 10:22:28 +01:00
Viktor Lofgren
9320a457a5 Misc tweaks and cleanups 2023-01-30 09:44:09 +01:00
Viktor Lofgren
1dac4e7e67 Override defaults in GSON 2023-01-30 09:43:21 +01:00
Viktor Lofgren
65b0ff26fc Better SiteWords extraction 2023-01-30 09:42:46 +01:00
Viktor Lofgren
5558af148e Reduce memory churn in KeywordCounter 2023-01-30 09:42:27 +01:00
Viktor Lofgren
8349435ef4 Better subject extraction and remove unnecessary calculation from DocumentKeywordExtractor 2023-01-30 09:41:54 +01:00
Viktor Lofgren
4d0b444703 String deduplication 2023-01-30 09:40:29 +01:00
Viktor Lofgren
0fd21b9cbf Reduce memory churn through BufferedReader via CrawledDomainReader 2023-01-30 09:39:16 +01:00
Viktor Lofgren
1b53a5389d Remove poorly guarded regex in UrlBlocklist 2023-01-30 09:37:37 +01:00
Viktor Lofgren
28214ad770 Remove unnecessary toLowerCase in isStopWord 2023-01-30 09:37:15 +01:00
Viktor Lofgren
dfd652a8d5 Make WordRep behave consistently across compareTo/equals 2023-01-30 09:36:47 +01:00
Viktor Lofgren
50862a2081 Refactor sentence extractor to break it apart into more readable chunks 2023-01-30 09:36:11 +01:00
Viktor Lofgren
ed728b2680 Compressed string component 2023-01-30 09:33:04 +01:00
Viktor Lofgren
728931c135 Compressed string component 2023-01-30 09:29:14 +01:00
Viktor Lofgren
1f646e4f68 Reduce memory churn in RDRPOSTagger 2023-01-30 09:25:57 +01:00
Viktor Lofgren
618582dc74 Performance optimizations in EdgeDomain's parsing, reduce the number of unguarded regular expressions 2023-01-30 09:23:11 +01:00
Viktor Lofgren
4854f40447 Array library optimizations for sortLargeSpan 2023-01-30 09:22:10 +01:00
Viktor Lofgren
c8f7a8cb69 Fix bug in dealing with scheme-relative URLs 2023-01-19 15:46:32 +01:00
Viktor Lofgren
467bf566a9 Hotfixes for 2023-01 release (#137)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/137
2023-01-11 19:48:03 +01:00
Viktor Lofgren
321a9028c7 Merge branch 'release' 2023-01-11 19:46:33 +01:00
Viktor Lofgren
5851e91424 Clean-up and fix for feature regression in site:-terms 2023-01-11 19:33:32 +01:00
Viktor Lofgren
fb2797a8ef Tweaking search result valuation 2023-01-11 19:33:05 +01:00
Viktor Lofgren
085d985e61 Result selection algorithm tweaks 2023-01-11 17:19:57 +01:00
Viktor Lofgren
69ccf143ac New search profile for hardcore web 1.0 content. 2023-01-11 16:11:51 +01:00
Viktor Lofgren
4d3ef0e3b3 Tool for cleaning raw index files based on a predicate. 2023-01-11 16:11:29 +01:00
Viktor Lofgren
4928b2e00e Use a mapped file instead of allocating to save memory (#136)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/136
2023-01-09 22:06:58 +01:00
Viktor Lofgren
cb408dd737 Fixes 2023-01-09 22:06:15 +01:00
Viktor Lofgren
4ec338d218 Merge branch 'release' 2023-01-09 20:21:33 +01:00
Viktor Lofgren
a9ddc328a6 Fixes from master (#135)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/135
2023-01-09 18:47:04 +01:00
Viktor Lofgren
4ff68e8807 Merge branch 'release'
# Conflicts:
#	marginalia_nu/src/main/java/nu/marginalia/wmsa/edge/index/postings/forward/ForwardIndexConverter.java
2023-01-09 18:46:48 +01:00
Viktor Lofgren
11b0d61efc Fixes 2023-01-09 18:45:04 +01:00
Viktor Lofgren
998ebc80a1 Hotfixes (#134)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/134
2023-01-09 18:23:19 +01:00
Viktor Lofgren
0fa1c8c16a Merge branch 'release'
# Conflicts:
#	marginalia_nu/src/main/java/nu/marginalia/wmsa/configuration/ServiceDescriptor.java
#	marginalia_nu/src/main/java/nu/marginalia/wmsa/edge/index/postings/forward/ForwardIndexConverter.java
2023-01-09 18:22:32 +01:00