vlofgren
|
930719583f
|
Experimental domain-searching feature
|
2022-07-28 18:18:35 +02:00 |
|
vlofgren
|
e0e9f7481e
|
Experimental domain-searching feature
|
2022-07-28 18:13:31 +02:00 |
|
vlofgren
|
43c7a6790a
|
Experimental domain-searching feature
|
2022-07-28 18:06:08 +02:00 |
|
vlofgren
|
3b3cca211d
|
Experimental domain-searching feature
|
2022-07-28 18:03:18 +02:00 |
|
vlofgren
|
bf328a0597
|
Experimental domain-searching feature
|
2022-07-28 17:58:45 +02:00 |
|
vlofgren
|
23b7a5fc22
|
NPE fix for index buckets that aren't loaded, experimental new query mode for domains.
|
2022-07-28 17:16:23 +02:00 |
|
vlofgren
|
793e917fe4
|
Fix exclude term duplication from js flag.
|
2022-07-28 14:57:09 +02:00 |
|
vlofgren
|
fd1f3f796e
|
Fix exclude term duplication from js flag.
|
2022-07-28 14:51:55 +02:00 |
|
vlofgren
|
667a80a3a0
|
Deduplicate domains in explore mode
|
2022-07-27 13:56:08 +02:00 |
|
vlofgren
|
c5c73610df
|
Tweak screenshot service
|
2022-07-26 17:10:14 +02:00 |
|
vlofgren
|
e4457de606
|
Update peruse algorithm, make resource store disk configurable.
|
2022-07-26 16:34:18 +02:00 |
|
vlofgren
|
f4bd754e37
|
Fix buggy madvise code, clean up preconverter
|
2022-07-26 13:51:55 +02:00 |
|
vlofgren
|
191b426797
|
Fix madvise code
|
2022-07-25 15:20:50 +02:00 |
|
vlofgren
|
da40172c68
|
Fix madvise code
|
2022-07-25 15:05:48 +02:00 |
|
vlofgren
|
daec6d9fc0
|
Fix overflow error
|
2022-07-25 12:43:03 +02:00 |
|
vlofgren
|
48812d8a4f
|
Store screenshots in database instead of in the filesystem.
|
2022-07-20 12:02:26 +02:00 |
|
vlofgren
|
6d1e2442b6
|
Store wiki articles in database instead of in the filesystem.
|
2022-07-20 11:16:21 +02:00 |
|
vlofgren
|
51d273e39d
|
Store wiki articles in database instead of in the filesystem.
|
2022-07-20 11:06:06 +02:00 |
|
vlofgren
|
fb91ce84f5
|
Reduce log spam during conversion
|
2022-07-19 05:08:06 +02:00 |
|
vlofgren
|
ba375ef769
|
Tweaks to keyword extraction
|
2022-07-19 05:02:44 +02:00 |
|
vlofgren
|
825dea839d
|
Tweaks to keyword extraction
|
2022-07-19 04:50:19 +02:00 |
|
vlofgren
|
64844e1db2
|
While some might ask, why would the server host IP be available as a search keyword? I only ask you hold my beer as I make it a reality.
|
2022-07-19 03:01:23 +02:00 |
|
vlofgren
|
e83a7435c6
|
Raise min document length a tad, we've been getting a bit too much almost empty documents in the index.
|
2022-07-19 01:42:17 +02:00 |
|
vlofgren
|
9ae76a9264
|
Retire old and broken gemini support, needs to be re-implemented by having Memex talk to the API service rather than going directly to Search.
|
2022-07-18 18:36:39 +02:00 |
|
vlofgren
|
15bd54ef9f
|
Tidy up LoaderMain a bit
|
2022-07-18 17:22:22 +02:00 |
|
vlofgren
|
3d1031f8e4
|
Add lexicon dumping utility
|
2022-07-18 17:13:47 +02:00 |
|
vlofgren
|
9f7a28cbdb
|
Made search service more robust toward the case where Encyclopedia or Assistant is down
|
2022-07-17 22:21:41 +02:00 |
|
vlofgren
|
e22748e990
|
Better error logging for IO errors during conversion from configuration issues.
|
2022-07-17 22:08:06 +02:00 |
|
vlofgren
|
e30a20bb74
|
Fix bug in keyword loading when keywords have non-ASCII symbols, cleaner solution
|
2022-07-17 19:31:49 +02:00 |
|
vlofgren
|
f4966cf1f9
|
Fix bug in keyword loading when keywords have non-ASCII symbols
|
2022-07-17 15:18:16 +02:00 |
|
vlofgren
|
c5dbe269f7
|
Better logging for URL errors
|
2022-07-17 15:17:39 +02:00 |
|
vlofgren
|
89cca4dbff
|
Better logging for rare parsing exception
|
2022-07-16 21:27:04 +02:00 |
|
vlofgren
|
80b3ac3dd8
|
Tweaking the URL block list to exclude git noise better
|
2022-07-16 21:19:13 +02:00 |
|
vlofgren
|
c71cc3d43a
|
Fix overflow bugs in DictionaryHashMap that only surfaced without small RAM
|
2022-07-16 18:58:19 +02:00 |
|
vlofgren
|
661577b456
|
Add Fossil SCM commits to URL blocklist
|
2022-07-14 14:45:31 +02:00 |
|
vlofgren
|
20970a6161
|
Make processor more lenient toward quality, accept content-types which specify charset
|
2022-07-14 12:37:06 +02:00 |
|
vlofgren
|
e9a270c015
|
Merge branch 'master' into experimental
|
2022-07-14 10:28:01 +02:00 |
|
vlofgren
|
63d9c70667
|
Fix Memex Update Form Jank
|
2022-07-14 10:22:38 +02:00 |
|
vlofgren
|
fed2fa9397
|
Fix tiny NPE in converting
|
2022-07-11 23:25:03 +02:00 |
|
vlofgren
|
b0c40136ca
|
Cleaned up HTML features code a bit.
|
2022-07-08 19:52:12 +02:00 |
|
vlofgren
|
7dea94d36d
|
Cleaned up HTML features code a bit.
|
2022-07-08 17:25:16 +02:00 |
|
vlofgren
|
2b83e0d754
|
Block websites with "acceptable ads", as this seems a strong indicator the domain is either parked or spam.
|
2022-07-08 16:50:00 +02:00 |
|
vlofgren
|
7a4f5c27a6
|
Merge branch 'master' into experimental
# Conflicts:
# marginalia_nu/src/e2e/resources/init.sh
|
2022-07-08 16:37:37 +02:00 |
|
vlofgren
|
f3be865293
|
Allow query params for *some* path,param combinations, targeted at allowing the crawl of forums.
|
2022-07-08 16:36:09 +02:00 |
|
vlofgren
|
93c274f1d4
|
E2E-test for memex
|
2022-07-08 12:34:31 +02:00 |
|
vlofgren
|
853108028e
|
WIP: Selective URL param strings
|
2022-07-04 14:47:16 +02:00 |
|
vlofgren
|
ee07c4d94a
|
Refactored s/DictionaryWriter/KeywordLexicon/g to use significantly less memory and (potentially) support UTF-8.
|
2022-06-26 16:44:08 +02:00 |
|
vlofgren
|
e1b3477115
|
Experiments in keyword extraction
|
2022-06-23 17:02:28 +02:00 |
|
vlofgren
|
4516b23f90
|
Also grab alt text for images in a-tags in anchor text extractor
|
2022-06-22 13:12:44 +02:00 |
|
vlofgren
|
48e4aa3ee8
|
Clean up old junk from the WordPatterns class
|
2022-06-22 13:01:46 +02:00 |
|
vlofgren
|
35878c5102
|
Anchor text capture work-in-progress
|
2022-06-22 12:57:58 +02:00 |
|
vlofgren
|
1068694db6
|
Refactoring BTreeReader and binary search code
|
2022-06-20 12:35:58 +02:00 |
|
vlofgren
|
8139ab0d1d
|
Refactoring BTreeReader and binary search code
|
2022-06-20 12:28:15 +02:00 |
|
vlofgren
|
b1eff0107c
|
Refactoring BTreeReader and binary search code
|
2022-06-20 12:25:34 +02:00 |
|
vlofgren
|
c324c80efc
|
Refactoring BTreeReader and binary search code
|
2022-06-20 12:04:06 +02:00 |
|
vlofgren
|
420b9bb7e0
|
Refactoring BTreeReader and binary search code
|
2022-06-20 12:02:01 +02:00 |
|
vlofgren
|
f76af4ca79
|
Refactoring conversion
|
2022-06-18 15:54:58 +02:00 |
|
vlofgren
|
2e55599850
|
Revert "Revert "Merge branch 'experimental' into master""
This reverts commit 81c77e7fcb .
|
2022-06-16 14:09:57 +02:00 |
|
vlofgren
|
5ef953ae3d
|
Fixing typo on front page.
|
2022-06-16 14:01:49 +02:00 |
|
vlofgren
|
81c77e7fcb
|
Revert "Merge branch 'experimental' into master"
This reverts commit c3a432fdd4 , reversing
changes made to 1de63f225d .
|
2022-06-15 16:49:18 +02:00 |
|
vlofgren
|
88908c203d
|
Refactoring conversion
|
2022-06-15 16:34:03 +02:00 |
|
vlofgren
|
8ba80931a9
|
Restructuring index code: Move dictionary
|
2022-06-15 12:59:56 +02:00 |
|
vlofgren
|
89f894eae2
|
Merge branch 'master' into experimental
|
2022-06-14 17:55:36 +02:00 |
|
vlofgren
|
1de63f225d
|
Added support for <base href>-style tags.
|
2022-06-14 17:55:14 +02:00 |
|
vlofgren
|
3e64003252
|
Re-add quality property to URLs
|
2022-06-09 22:19:29 +02:00 |
|
vlofgren
|
1ee0c2b572
|
Merge branch 'master' into experimental
|
2022-06-09 21:48:19 +02:00 |
|
vlofgren
|
389818c6c3
|
Make website url configurable for search engine redirects
|
2022-06-09 21:47:59 +02:00 |
|
vlofgren
|
65aee9419d
|
Tidy up
|
2022-06-09 21:25:31 +02:00 |
|
vlofgren
|
495e6a1639
|
Use 64 bit path hash for EC_URL
|
2022-06-08 16:52:46 +02:00 |
|
vlofgren
|
2faaed3393
|
Fixed conversion bug SQL->EdgeDomainIndexingState
|
2022-06-08 16:52:33 +02:00 |
|
vlofgren
|
5e472fe121
|
WIP: Refactored ranking algorithms to separate database code from ranking code
|
2022-06-08 16:18:00 +02:00 |
|
vlofgren
|
026ba714b5
|
WIP: Database refactoring
|
2022-06-08 15:32:03 +02:00 |
|
vlofgren
|
c915664fcc
|
WIP: Database refactoring
|
2022-06-07 22:34:53 +02:00 |
|
vlofgren
|
0e65384781
|
Make WMSA_HOME configurable through an environment variable.
|
2022-06-03 13:32:08 +02:00 |
|
vlofgren
|
d8d0c0e5b2
|
Make User-agent configurable.
|
2022-06-01 14:46:51 +02:00 |
|
vlofgren
|
80dad31753
|
Merge branch 'release'
# Conflicts:
# marginalia_nu/src/main/java/nu/marginalia/wmsa/edge/index/service/query/IndexSearchBudget.java
|
2022-05-31 14:37:49 +02:00 |
|
vlofgren
|
c0e0579c8e
|
Updated index.html for search engine to reflect changes in project status.
|
2022-05-31 14:35:05 +02:00 |
|
vlofgren
|
046b92e0bb
|
Cleaning up index code
|
2022-05-31 14:35:05 +02:00 |
|
vlofgren
|
ab97044302
|
Fix deprecation warning for Bucket4J
|
2022-05-31 13:40:21 +02:00 |
|
Viktor Lofgren
|
9474f39225
|
Add time-based timeout to queries (#24)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/24
|
2022-05-31 13:38:26 +02:00 |
|
vlofgren
|
ec87c0689f
|
Added timeout to queries
|
2022-05-31 13:37:24 +02:00 |
|
Viktor Lofgren
|
fcd2708fe3
|
Memory alignment tweaks for better performance (#22)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/22
|
2022-05-30 23:42:40 +02:00 |
|
vlofgren
|
fc070f2e0e
|
Fixed memory alignment for MMFL
|
2022-05-30 23:41:16 +02:00 |
|
Viktor Lofgren
|
c7a095e497
|
Madvise tweaks (#21)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/21
|
2022-05-30 23:22:05 +02:00 |
|
vlofgren
|
6894121859
|
Tweaked madvise for index to be faster
|
2022-05-30 23:19:55 +02:00 |
|
Viktor Lofgren
|
44bee371e6
|
Actually add the commit with the previously mentioned instrumetation (#18)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/18
|
2022-05-30 21:12:15 +02:00 |
|
vlofgren
|
dc963d3e44
|
Added instrumentation for search queries
|
2022-05-30 21:11:19 +02:00 |
|
Viktor Lofgren
|
c201201c2d
|
Instrumentation for search + index madvise tweaks (#17)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/17
|
2022-05-30 21:02:53 +02:00 |
|
vlofgren
|
730e964475
|
Tweaked madvise for index to be faster
|
2022-05-30 21:01:58 +02:00 |
|
vlofgren
|
275e42197c
|
Added rudimentary !bang-support
|
2022-05-30 17:26:51 +02:00 |
|
vlofgren
|
25776a9718
|
Refactored EdgeSearchService and broke functions like define:, browse:, site: etc. into separate classes.
|
2022-05-30 16:40:59 +02:00 |
|
vlofgren
|
41b686955f
|
API-service was accidentally moved into a subdirectory of Auth
|
2022-05-30 12:46:30 +02:00 |
|
vlofgren
|
5a1ec53a84
|
WIP: Encyclopedia service
|
2022-05-28 14:35:32 +02:00 |
|
vlofgren
|
0acdd5b660
|
Switch to beefier docker image to fix 'Could not initialize class sun.awt.X11FontManager' for math rendering in Encyclopedia test.
|
2022-05-28 13:59:50 +02:00 |
|
vlofgren
|
ac9064096d
|
Rewrote Encyclopedia loader, added functioning E2E test for new encyclopedia service
|
2022-05-28 13:51:29 +02:00 |
|
vlofgren
|
ad4521da9e
|
WIP: Killing off Archive service, adding new Encyclopedia service consisting largely of what Archive was and a few features from Assistant.
|
2022-05-28 00:16:31 +02:00 |
|
vlofgren
|
e7b4ac0d34
|
WIP: Killing off Archive service, adding new Encyclopedia service consisting largely of what Archive was and a few features from Assistant.
|
2022-05-27 23:45:29 +02:00 |
|
vlofgren
|
61ef2b06b0
|
Move IP Location database out of classpath and into WMSA_HOME/data
|
2022-05-27 14:27:44 +02:00 |
|
vlofgren
|
056dec5506
|
Fixings
|
2022-05-27 13:50:06 +02:00 |
|
vlofgren
|
014a4c8076
|
Deleted old JMH benchmarks that weren't used for anything useful, fixed tests
|
2022-05-25 20:43:30 +02:00 |
|
vlofgren
|
ee6471dcaf
|
Retired the Data-Store, Director and Crawler services
|
2022-05-25 20:38:57 +02:00 |
|
Viktor Lofgren
|
cd3cae0ad5
|
Create first E2E-test with TestContainers
|
2022-05-25 18:02:19 +02:00 |
|
vlofgren
|
b45f68fedd
|
Fix build problems on jdk-18 machines
|
2022-05-21 15:14:47 +02:00 |
|
vlofgren
|
c4017c23ed
|
Fixing some compiler warnings + a rotten test
|
2022-05-19 22:02:01 +02:00 |
|
vlofgren
|
ccc5a07081
|
Extracted ranking algorithms to separate directory and made them configurable
|
2022-05-19 19:13:41 +02:00 |
|
vlofgren
|
1012de3135
|
Moving tool class to tools dir
|
2022-05-19 18:30:57 +02:00 |
|
vlofgren
|
5c04c0843b
|
Removing unnecessary files
|
2022-05-19 18:30:41 +02:00 |
|
vlofgren
|
74ae97f8f4
|
Added test util for the tests to remove hard coding of LanguageModels.
|
2022-05-19 18:05:10 +02:00 |
|
vlofgren
|
c24b978c51
|
first commit
|
2022-05-19 17:45:26 +02:00 |
|