Commit Graph

101 Commits

Author SHA1 Message Date
vlofgren
80b3ac3dd8 Tweaking the URL block list to exclude git noise better 2022-07-16 21:19:13 +02:00
vlofgren
c71cc3d43a Fix overflow bugs in DictionaryHashMap that only surfaced without small RAM 2022-07-16 18:58:19 +02:00
vlofgren
661577b456 Add Fossil SCM commits to URL blocklist 2022-07-14 14:45:31 +02:00
vlofgren
20970a6161 Make processor more lenient toward quality, accept content-types which specify charset 2022-07-14 12:37:06 +02:00
vlofgren
e9a270c015 Merge branch 'master' into experimental 2022-07-14 10:28:01 +02:00
vlofgren
63d9c70667 Fix Memex Update Form Jank 2022-07-14 10:22:38 +02:00
vlofgren
fed2fa9397 Fix tiny NPE in converting 2022-07-11 23:25:03 +02:00
vlofgren
b0c40136ca Cleaned up HTML features code a bit. 2022-07-08 19:52:12 +02:00
vlofgren
7dea94d36d Cleaned up HTML features code a bit. 2022-07-08 17:25:16 +02:00
vlofgren
2b83e0d754 Block websites with "acceptable ads", as this seems a strong indicator the domain is either parked or spam. 2022-07-08 16:50:00 +02:00
vlofgren
7a4f5c27a6 Merge branch 'master' into experimental
# Conflicts:
#	marginalia_nu/src/e2e/resources/init.sh
2022-07-08 16:37:37 +02:00
vlofgren
f3be865293 Allow query params for *some* path,param combinations, targeted at allowing the crawl of forums. 2022-07-08 16:36:09 +02:00
Viktor Lofgren
978311327e Merge branch 'release' into master 2022-07-08 12:36:18 +02:00
vlofgren
93c274f1d4 E2E-test for memex 2022-07-08 12:34:31 +02:00
vlofgren
853108028e WIP: Selective URL param strings 2022-07-04 14:47:16 +02:00
vlofgren
ee07c4d94a Refactored s/DictionaryWriter/KeywordLexicon/g to use significantly less memory and (potentially) support UTF-8. 2022-06-26 16:44:08 +02:00
vlofgren
e1b3477115 Experiments in keyword extraction 2022-06-23 17:02:28 +02:00
vlofgren
4516b23f90 Also grab alt text for images in a-tags in anchor text extractor 2022-06-22 13:12:44 +02:00
vlofgren
48e4aa3ee8 Clean up old junk from the WordPatterns class 2022-06-22 13:01:46 +02:00
vlofgren
35878c5102 Anchor text capture work-in-progress 2022-06-22 12:57:58 +02:00
vlofgren
1068694db6 Refactoring BTreeReader and binary search code 2022-06-20 12:35:58 +02:00
vlofgren
8139ab0d1d Refactoring BTreeReader and binary search code 2022-06-20 12:28:15 +02:00
vlofgren
b1eff0107c Refactoring BTreeReader and binary search code 2022-06-20 12:25:34 +02:00
vlofgren
c324c80efc Refactoring BTreeReader and binary search code 2022-06-20 12:04:06 +02:00
vlofgren
420b9bb7e0 Refactoring BTreeReader and binary search code 2022-06-20 12:02:01 +02:00
vlofgren
f76af4ca79 Refactoring conversion 2022-06-18 15:54:58 +02:00
Viktor Lofgren
8df48d1c6d Fix front page typo (#29)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/29
2022-06-16 14:15:54 +02:00
Viktor Lofgren
b86ca895b0 Merge branch 'release' into master 2022-06-16 14:14:18 +02:00
vlofgren
63bdc28f79 Merge branch 'experimental' into experimental-new 2022-06-16 14:10:08 +02:00
vlofgren
2e55599850 Revert "Revert "Merge branch 'experimental' into master""
This reverts commit 81c77e7fcb.
2022-06-16 14:09:57 +02:00
vlofgren
082c9cc308 Fixing typo on front page.
(cherry picked from commit 5ef953ae3d)
2022-06-16 14:06:48 +02:00
vlofgren
5ef953ae3d Fixing typo on front page. 2022-06-16 14:01:49 +02:00
Viktor Lofgren
a3a6b40cc3 Changes to crawler (#28)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/28
2022-06-15 16:54:27 +02:00
vlofgren
8100bd4879 conflict 2022-06-15 16:53:19 +02:00
vlofgren
81c77e7fcb Revert "Merge branch 'experimental' into master"
This reverts commit c3a432fdd4, reversing
changes made to 1de63f225d.
2022-06-15 16:49:18 +02:00
Viktor Lofgren
c3a432fdd4 Merge branch 'experimental' into master 2022-06-15 16:44:23 +02:00
vlofgren
88908c203d Refactoring conversion 2022-06-15 16:34:03 +02:00
vlofgren
8ba80931a9 Restructuring index code: Move dictionary 2022-06-15 12:59:56 +02:00
vlofgren
89f894eae2 Merge branch 'master' into experimental 2022-06-14 17:55:36 +02:00
vlofgren
1de63f225d Added support for <base href>-style tags. 2022-06-14 17:55:14 +02:00
vlofgren
3e64003252 Re-add quality property to URLs 2022-06-09 22:19:29 +02:00
vlofgren
1ee0c2b572 Merge branch 'master' into experimental 2022-06-09 21:48:19 +02:00
vlofgren
389818c6c3 Make website url configurable for search engine redirects 2022-06-09 21:47:59 +02:00
vlofgren
65aee9419d Tidy up 2022-06-09 21:25:31 +02:00
vlofgren
495e6a1639 Use 64 bit path hash for EC_URL 2022-06-08 16:52:46 +02:00
vlofgren
2faaed3393 Fixed conversion bug SQL->EdgeDomainIndexingState 2022-06-08 16:52:33 +02:00
vlofgren
5e472fe121 WIP: Refactored ranking algorithms to separate database code from ranking code 2022-06-08 16:18:00 +02:00
vlofgren
026ba714b5 WIP: Database refactoring 2022-06-08 15:32:03 +02:00
vlofgren
c915664fcc WIP: Database refactoring 2022-06-07 22:34:53 +02:00
vlofgren
0e65384781 Make WMSA_HOME configurable through an environment variable. 2022-06-03 13:32:08 +02:00