Commit Graph

576 Commits

Author SHA1 Message Date
vlofgren
daec6d9fc0 Fix overflow error 2022-07-25 12:43:03 +02:00
Viktor Lofgren
8c6a8fb7aa Merge pull request 'master' (#35) from master into release
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/35
2022-07-22 00:34:56 +02:00
Viktor Lofgren
6a5d8c25f5 Merge branch 'release' into master 2022-07-22 00:34:43 +02:00
Viktor Lofgren
1a926f359c Merge pull request 'experimental' (#34) from experimental into master
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/34
2022-07-22 00:04:35 +02:00
Viktor Lofgren
a4c7b8918d Merge branch 'master' into experimental 2022-07-22 00:04:08 +02:00
vlofgren
48812d8a4f Store screenshots in database instead of in the filesystem. 2022-07-20 12:02:26 +02:00
vlofgren
6d1e2442b6 Store wiki articles in database instead of in the filesystem. 2022-07-20 11:16:21 +02:00
vlofgren
51d273e39d Store wiki articles in database instead of in the filesystem. 2022-07-20 11:06:06 +02:00
vlofgren
fb91ce84f5 Reduce log spam during conversion 2022-07-19 05:08:06 +02:00
vlofgren
ba375ef769 Tweaks to keyword extraction 2022-07-19 05:02:44 +02:00
vlofgren
825dea839d Tweaks to keyword extraction 2022-07-19 04:50:19 +02:00
vlofgren
64844e1db2 While some might ask, why would the server host IP be available as a search keyword? I only ask you hold my beer as I make it a reality. 2022-07-19 03:01:23 +02:00
vlofgren
e83a7435c6 Raise min document length a tad, we've been getting a bit too much almost empty documents in the index. 2022-07-19 01:42:17 +02:00
vlofgren
9ae76a9264 Retire old and broken gemini support, needs to be re-implemented by having Memex talk to the API service rather than going directly to Search. 2022-07-18 18:36:39 +02:00
vlofgren
15bd54ef9f Tidy up LoaderMain a bit 2022-07-18 17:22:22 +02:00
vlofgren
3d1031f8e4 Add lexicon dumping utility 2022-07-18 17:13:47 +02:00
vlofgren
9f7a28cbdb Made search service more robust toward the case where Encyclopedia or Assistant is down 2022-07-17 22:21:41 +02:00
vlofgren
e22748e990 Better error logging for IO errors during conversion from configuration issues. 2022-07-17 22:08:06 +02:00
vlofgren
e30a20bb74 Fix bug in keyword loading when keywords have non-ASCII symbols, cleaner solution 2022-07-17 19:31:49 +02:00
vlofgren
f4966cf1f9 Fix bug in keyword loading when keywords have non-ASCII symbols 2022-07-17 15:18:16 +02:00
vlofgren
c5dbe269f7 Better logging for URL errors 2022-07-17 15:17:39 +02:00
vlofgren
89cca4dbff Better logging for rare parsing exception 2022-07-16 21:27:04 +02:00
vlofgren
80b3ac3dd8 Tweaking the URL block list to exclude git noise better 2022-07-16 21:19:13 +02:00
vlofgren
c71cc3d43a Fix overflow bugs in DictionaryHashMap that only surfaced without small RAM 2022-07-16 18:58:19 +02:00
vlofgren
661577b456 Add Fossil SCM commits to URL blocklist 2022-07-14 14:45:31 +02:00
vlofgren
20970a6161 Make processor more lenient toward quality, accept content-types which specify charset 2022-07-14 12:37:06 +02:00
vlofgren
e9a270c015 Merge branch 'master' into experimental 2022-07-14 10:28:01 +02:00
Viktor Lofgren
3197023834 Merge pull request 'Fix Memex Update Form Jank' (#33) from master into release
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/33
2022-07-14 10:23:38 +02:00
Viktor Lofgren
ac9a3b6a2a Merge branch 'release' into master 2022-07-14 10:23:17 +02:00
vlofgren
63d9c70667 Fix Memex Update Form Jank 2022-07-14 10:22:38 +02:00
vlofgren
fed2fa9397 Fix tiny NPE in converting 2022-07-11 23:25:03 +02:00
vlofgren
b0c40136ca Cleaned up HTML features code a bit. 2022-07-08 19:52:12 +02:00
vlofgren
7dea94d36d Cleaned up HTML features code a bit. 2022-07-08 17:25:16 +02:00
vlofgren
2b83e0d754 Block websites with "acceptable ads", as this seems a strong indicator the domain is either parked or spam. 2022-07-08 16:50:00 +02:00
vlofgren
7a4f5c27a6 Merge branch 'master' into experimental
# Conflicts:
#	marginalia_nu/src/e2e/resources/init.sh
2022-07-08 16:37:37 +02:00
vlofgren
f3be865293 Allow query params for *some* path,param combinations, targeted at allowing the crawl of forums. 2022-07-08 16:36:09 +02:00
Viktor Lofgren
e219bd83f3 Merge pull request 'Memex refactored' (#32) from master into release
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/32
2022-07-08 12:38:30 +02:00
Viktor Lofgren
978311327e Merge branch 'release' into master 2022-07-08 12:36:18 +02:00
vlofgren
93c274f1d4 E2E-test for memex 2022-07-08 12:34:31 +02:00
vlofgren
853108028e WIP: Selective URL param strings 2022-07-04 14:47:16 +02:00
vlofgren
ee07c4d94a Refactored s/DictionaryWriter/KeywordLexicon/g to use significantly less memory and (potentially) support UTF-8. 2022-06-26 16:44:08 +02:00
vlofgren
e1b3477115 Experiments in keyword extraction 2022-06-23 17:02:28 +02:00
vlofgren
4516b23f90 Also grab alt text for images in a-tags in anchor text extractor 2022-06-22 13:12:44 +02:00
vlofgren
48e4aa3ee8 Clean up old junk from the WordPatterns class 2022-06-22 13:01:46 +02:00
vlofgren
35878c5102 Anchor text capture work-in-progress 2022-06-22 12:57:58 +02:00
vlofgren
1068694db6 Refactoring BTreeReader and binary search code 2022-06-20 12:35:58 +02:00
vlofgren
8139ab0d1d Refactoring BTreeReader and binary search code 2022-06-20 12:28:15 +02:00
vlofgren
b1eff0107c Refactoring BTreeReader and binary search code 2022-06-20 12:25:34 +02:00
vlofgren
c324c80efc Refactoring BTreeReader and binary search code 2022-06-20 12:04:06 +02:00
vlofgren
420b9bb7e0 Refactoring BTreeReader and binary search code 2022-06-20 12:02:01 +02:00