Commit Graph

282 Commits

Author SHA1 Message Date
Viktor Lofgren
4a6a1308b0 Remove min length regex, the guard is too weak to be meaningful 2023-01-30 10:43:53 +01:00
Viktor Lofgren
2e4532ca90 Clean up KeywordMetadata 2023-01-30 10:22:43 +01:00
Viktor Lofgren
9320a457a5 Misc tweaks and cleanups 2023-01-30 09:44:09 +01:00
Viktor Lofgren
65b0ff26fc Better SiteWords extraction 2023-01-30 09:42:46 +01:00
Viktor Lofgren
5558af148e Reduce memory churn in KeywordCounter 2023-01-30 09:42:27 +01:00
Viktor Lofgren
8349435ef4 Better subject extraction and remove unnecessary calculation from DocumentKeywordExtractor 2023-01-30 09:41:54 +01:00
Viktor Lofgren
4d0b444703 String deduplication 2023-01-30 09:40:29 +01:00
Viktor Lofgren
0fd21b9cbf Reduce memory churn through BufferedReader via CrawledDomainReader 2023-01-30 09:39:16 +01:00
Viktor Lofgren
1b53a5389d Remove poorly guarded regex in UrlBlocklist 2023-01-30 09:37:37 +01:00
Viktor Lofgren
28214ad770 Remove unnecessary toLowerCase in isStopWord 2023-01-30 09:37:15 +01:00
Viktor Lofgren
dfd652a8d5 Make WordRep behave consistently across compareTo/equals 2023-01-30 09:36:47 +01:00
Viktor Lofgren
50862a2081 Refactor sentence extractor to break it apart into more readable chunks 2023-01-30 09:36:11 +01:00
Viktor Lofgren
ed728b2680 Compressed string component 2023-01-30 09:33:04 +01:00
Viktor Lofgren
728931c135 Compressed string component 2023-01-30 09:29:14 +01:00
Viktor Lofgren
618582dc74 Performance optimizations in EdgeDomain's parsing, reduce the number of unguarded regular expressions 2023-01-30 09:23:11 +01:00
Viktor Lofgren
4854f40447 Array library optimizations for sortLargeSpan 2023-01-30 09:22:10 +01:00
Viktor Lofgren
c8f7a8cb69 Fix bug in dealing with scheme-relative URLs 2023-01-19 15:46:32 +01:00
Viktor Lofgren
5851e91424 Clean-up and fix for feature regression in site:-terms 2023-01-11 19:33:32 +01:00
Viktor Lofgren
fb2797a8ef Tweaking search result valuation 2023-01-11 19:33:05 +01:00
Viktor Lofgren
085d985e61 Result selection algorithm tweaks 2023-01-11 17:19:57 +01:00
Viktor Lofgren
69ccf143ac New search profile for hardcore web 1.0 content. 2023-01-11 16:11:51 +01:00
Viktor Lofgren
4d3ef0e3b3 Tool for cleaning raw index files based on a predicate. 2023-01-11 16:11:29 +01:00
Viktor Lofgren
cb408dd737 Fixes 2023-01-09 22:06:15 +01:00
Viktor Lofgren
11b0d61efc Fixes 2023-01-09 18:45:04 +01:00
Viktor Lofgren
0b6200705e Bugfix in forward converter, should force both files before exiting. Also don't need to create an intermediate file. 2023-01-09 16:57:58 +01:00
Viktor Lofgren
58cae7d963 Bugfix for logs. 2023-01-09 15:46:11 +01:00
Viktor Lofgren
6d33c386fc Merge changes from experimental branch (#132)
Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/132
2023-01-08 11:11:44 +01:00
Viktor Lofgren
c057ce74a8 Bugfix for rare bug where some queries may miss hits due to BTreeReader's retain function giving up too fast. 2022-11-22 16:33:29 +01:00
Viktor Lofgren
baaf21911a Reduce resource usage waste in edge-search by recycling QueryVariants 2022-11-18 17:12:34 +01:00
Viktor Lofgren
e86f52d7d8 Reduce resource usage waste in edge-search by recycling QueryVariants 2022-11-18 17:09:07 +01:00
Viktor Lofgren
655504c1f0 Hotfix for NaN-serialization bug in API service. 2022-11-06 12:12:10 +01:00
vlofgren
27893b414b Merge branch 'release'
# Conflicts:
#	marginalia_nu/src/main/java/nu/marginalia/wmsa/edge/search/command/commands/BrowseCommand.java
2022-10-30 11:33:06 +01:00
vlofgren
e7623010db Fetch more browse:domain-results. 2022-10-30 11:30:11 +01:00
Viktor Lofgren
395da07abe Sort browse:-results by relatedness if possible (#125)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/125
2022-10-30 10:56:01 +01:00
vlofgren
b97f425f7e Sort results by relatedness where possible. 2022-10-30 10:49:41 +01:00
Viktor Lofgren
c559611185 Prefer cosine similarity relatedness for browse:-queries. (#123)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/123
2022-10-30 10:32:33 +01:00
vlofgren
6231f525fd Prefer cosine similarity relatedness for browse:-queries. 2022-10-30 10:31:37 +01:00
Viktor Lofgren
e676d8729e GUI fixes and cleanups (#122)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/122
2022-10-30 10:08:19 +01:00
vlofgren
61a80b417b Fix for explore2.marginalia.nu where it wouldn't find some websites that were flagged as redirects. 2022-10-30 10:05:52 +01:00
vlofgren
cc5b425661 Add another w3m-helper bar to make the UI cleaner on terminal. 2022-10-30 09:56:37 +01:00
vlofgren
217584126c Improved publishing date heuristics 2022-10-29 11:20:01 +02:00
vlofgren
68ec3304a3 Update index 2022-10-27 19:16:35 +02:00
vlofgren
af8001d41e Less janky summary extraction 2022-10-27 19:16:35 +02:00
vlofgren
94c157c5c3 Publish-date guesser 2022-10-27 19:16:35 +02:00
Viktor Lofgren
c6abbc12f6 fix serialization issue (#121)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/121
2022-10-22 15:01:41 +02:00
vlofgren
8f8e6e147f Fix JSON serialization error 2022-10-22 14:42:37 +02:00
vlofgren
e6da7c1a29 Tweaks for new release. 2022-10-21 17:44:29 +02:00
Viktor Lofgren
0a35a7c1d0 master (#119)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/119
2022-10-20 21:57:08 +02:00
vlofgren
5393167bf8 Fixes in sorting logic, and optimized update domain statistics to not take 4+ hours. 2022-10-20 21:55:51 +02:00
vlofgren
05762fe200 Index update. 2022-10-19 16:35:50 +02:00