Commit Graph

282 Commits

Author SHA1 Message Date
vlofgren
fc9d9d1bad And revert the previous change as my IP got kicked back to ol' reliable '81.170.128.52' 2022-08-22 17:32:56 +02:00
vlofgren
087ad0124d Update crawler IP file to reflect the fact that the IP changed. 2022-08-22 13:04:07 +02:00
vlofgren
095ed7c6c4 Tweak CSS a tiny bit to add more padding to the right of info cells. 2022-08-19 16:07:26 +02:00
vlofgren
2adbe5f74c Update publicity roll. 2022-08-19 15:55:01 +02:00
vlofgren
56987f6664 Update publicity roll. 2022-08-19 15:50:15 +02:00
vlofgren
7567890708 Update publicity roll. 2022-08-19 15:49:52 +02:00
vlofgren
ede62f2515 Retain cookies for domain. 2022-08-18 20:44:44 +02:00
vlofgren
a1eb8375a2 Exclude wp-content/uploads from crawling 2022-08-18 19:05:07 +02:00
vlofgren
340d80f6c7 Don't try to fetch text/css and text/javascript-files. Refactor fetcher to separate content type sniffing logic. Clean up crawler a smidge. 2022-08-18 18:40:34 +02:00
vlofgren
6b6cd56e3a Don't try to fetch text/css and text/javascript-files. Refactor fetcher to separate content type sniffing logic. Clean up crawler a smidge. 2022-08-18 18:25:12 +02:00
vlofgren
4afccdc536 Don't try to fetch ftp://, webcal://, etc. 2022-08-18 17:25:22 +02:00
vlofgren
5cd552458a Fix fragment bug. 2022-08-18 16:47:59 +02:00
vlofgren
2bc81e8e9a Fix fragment bug. 2022-08-18 16:45:51 +02:00
vlofgren
a034e3245e Fix fragment bug. 2022-08-18 16:43:34 +02:00
vlofgren
0bac422091 Fix bug in redirect handling that caused the crawler to not index some documents. 2022-08-17 00:51:10 +02:00
vlofgren
ce9abc00dc Fix bug in redirect handling that caused the crawler to not index some documents. 2022-08-17 00:49:32 +02:00
vlofgren
5cfef610b0 Preparations for new crawl round 2022-08-16 22:48:16 +02:00
vlofgren
123675d73b More caching 2022-08-15 15:39:10 +02:00
vlofgren
ceacfa5917 Tune down log spam 2022-08-15 15:37:26 +02:00
vlofgren
f6b3e75cee Optimize search service by removing weird query spam 2022-08-15 15:27:22 +02:00
vlofgren
beafdfda9c Index optimizations that should reduce small object churn and IOPS a bit. 2022-08-15 13:58:18 +02:00
vlofgren
460dd098b0 Add advertisement Feature to search,
Add adblock simulation to processor,
Add filename and email address extraction to processor.
2022-08-12 17:12:16 +02:00
vlofgren
30d2a707ff Add advertisement Feature to search,
Add adblock simulation to processor,
Add filename and email address extraction to processor.
2022-08-12 13:50:18 +02:00
vlofgren
0e28ff5a72 Add features to suggestions 2022-08-10 21:32:19 +02:00
vlofgren
ba9e0d9829 Add features to suggestions 2022-08-10 19:50:14 +02:00
vlofgren
ffde8c8305 Faster crawling 2022-08-10 18:46:13 +02:00
vlofgren
ce09fce639 Faster crawling 2022-08-10 17:03:58 +02:00
vlofgren
9c6e3b1772 Topical detection (experimental),
Adblock simulation (experimental)
2022-08-10 15:04:29 +02:00
vlofgren
d7167f956e Adjust search result sort order to penalize scriptiness a bit 2022-08-08 18:59:57 +02:00
vlofgren
0f59675f7c Clean up preconverter code 2022-08-08 18:08:18 +02:00
vlofgren
2af2c50f34 Clean up preconverter code 2022-08-08 15:29:47 +02:00
vlofgren
2bfde9d030 Recipe detection 2022-08-08 15:18:18 +02:00
vlofgren
0dfcf2f7af Recipe detection 2022-08-08 15:18:07 +02:00
vlofgren
5c952d48f4 Speed up conversion 2022-08-08 15:18:07 +02:00
vlofgren
e39320d51d Add support for additional random sets 2022-08-07 17:51:35 +02:00
vlofgren
b9bbda0e2e Add support for additional random sets 2022-08-07 17:49:32 +02:00
vlofgren
743ba23f55 Add support for additional random sets 2022-08-07 17:46:30 +02:00
vlofgren
5fbafa63c1 Add better fallbacks to summary extractor 2022-08-06 15:17:00 +02:00
vlofgren
e22fde69ed Screenshot bot 2022-08-04 21:14:17 +02:00
vlofgren
a6a6bdb013 Test rewarding linked terms. 2022-08-02 17:52:24 +02:00
vlofgren
6e68f930a6 Test rewarding linked terms. 2022-08-02 17:50:25 +02:00
vlofgren
0b61910b84 Test rewarding linked terms. 2022-08-02 17:43:21 +02:00
vlofgren
487d74592d Test rewarding linked terms. 2022-08-02 17:38:18 +02:00
vlofgren
ae2419e2a5 Reduced max domain results for search command,
made it easier to configure.
2022-08-02 12:23:24 +02:00
vlofgren
c9eef92291 Updated opensearch def with hint to use api for automation. 2022-08-02 12:23:24 +02:00
vlofgren
3ccb1c6218 Simplified query builders, preparation for a-tag inclusion. 2022-08-01 20:29:15 +02:00
vlofgren
9a4183a481 A-tags loader 2022-08-01 20:05:55 +02:00
vlofgren
9a6c8339d0 Clean up DAO 2022-08-01 20:05:21 +02:00
vlofgren
7f985c0a57 Experimental domain-searching feature 2022-07-28 21:33:36 +02:00
vlofgren
e17d3015dc Experimental domain-searching feature 2022-07-28 21:29:34 +02:00
vlofgren
8428198e61 Experimental domain-searching feature 2022-07-28 21:09:48 +02:00
vlofgren
c75c1db475 Experimental domain-searching feature 2022-07-28 20:50:40 +02:00
vlofgren
f027a72df9 Experimental domain-searching feature 2022-07-28 20:43:45 +02:00
vlofgren
449bb76c83 Experimental domain-searching feature 2022-07-28 20:26:07 +02:00
vlofgren
913599426f Experimental domain-searching feature 2022-07-28 20:25:57 +02:00
vlofgren
145b02a736 Experimental domain-searching feature 2022-07-28 20:22:38 +02:00
vlofgren
ea5dbb301e Experimental domain-searching feature 2022-07-28 20:06:51 +02:00
vlofgren
3916c05a02 Experimental domain-searching feature 2022-07-28 19:50:02 +02:00
vlofgren
6a2b199604 Experimental domain-searching feature 2022-07-28 19:45:44 +02:00
vlofgren
09aa217451 Experimental domain-searching feature 2022-07-28 19:45:03 +02:00
vlofgren
f1f4674e1c Experimental domain-searching feature 2022-07-28 19:29:03 +02:00
vlofgren
ea312c7b61 Experimental domain-searching feature 2022-07-28 19:26:19 +02:00
vlofgren
806c81a3a3 Experimental domain-searching feature 2022-07-28 19:18:46 +02:00
vlofgren
27222fa192 Experimental domain-searching feature 2022-07-28 19:14:53 +02:00
vlofgren
29a2bc1d9a Experimental domain-searching feature 2022-07-28 19:05:53 +02:00
vlofgren
14a6b60945 Experimental domain-searching feature 2022-07-28 19:02:27 +02:00
vlofgren
e3b2b36f03 Experimental domain-searching feature 2022-07-28 19:01:54 +02:00
vlofgren
e9db8b6c1d Experimental domain-searching feature 2022-07-28 18:58:54 +02:00
vlofgren
e68cee5b58 Experimental domain-searching feature 2022-07-28 18:48:49 +02:00
vlofgren
b49ebda5dd Experimental domain-searching feature 2022-07-28 18:46:06 +02:00
vlofgren
81c72b186b Experimental domain-searching feature 2022-07-28 18:37:10 +02:00
vlofgren
ada11eb849 Experimental domain-searching feature 2022-07-28 18:34:01 +02:00
vlofgren
55b549903f Experimental domain-searching feature 2022-07-28 18:34:01 +02:00
vlofgren
930719583f Experimental domain-searching feature 2022-07-28 18:18:35 +02:00
vlofgren
e0e9f7481e Experimental domain-searching feature 2022-07-28 18:13:31 +02:00
vlofgren
43c7a6790a Experimental domain-searching feature 2022-07-28 18:06:08 +02:00
vlofgren
3b3cca211d Experimental domain-searching feature 2022-07-28 18:03:18 +02:00
vlofgren
bf328a0597 Experimental domain-searching feature 2022-07-28 17:58:45 +02:00
vlofgren
23b7a5fc22 NPE fix for index buckets that aren't loaded, experimental new query mode for domains. 2022-07-28 17:16:23 +02:00
vlofgren
793e917fe4 Fix exclude term duplication from js flag. 2022-07-28 14:57:09 +02:00
vlofgren
fd1f3f796e Fix exclude term duplication from js flag. 2022-07-28 14:51:55 +02:00
vlofgren
667a80a3a0 Deduplicate domains in explore mode 2022-07-27 13:56:08 +02:00
vlofgren
c5c73610df Tweak screenshot service 2022-07-26 17:10:14 +02:00
vlofgren
e4457de606 Update peruse algorithm, make resource store disk configurable. 2022-07-26 16:34:18 +02:00
vlofgren
f4bd754e37 Fix buggy madvise code, clean up preconverter 2022-07-26 13:51:55 +02:00
vlofgren
191b426797 Fix madvise code 2022-07-25 15:20:50 +02:00
vlofgren
da40172c68 Fix madvise code 2022-07-25 15:05:48 +02:00
vlofgren
daec6d9fc0 Fix overflow error 2022-07-25 12:43:03 +02:00
vlofgren
48812d8a4f Store screenshots in database instead of in the filesystem. 2022-07-20 12:02:26 +02:00
vlofgren
6d1e2442b6 Store wiki articles in database instead of in the filesystem. 2022-07-20 11:16:21 +02:00
vlofgren
51d273e39d Store wiki articles in database instead of in the filesystem. 2022-07-20 11:06:06 +02:00
vlofgren
fb91ce84f5 Reduce log spam during conversion 2022-07-19 05:08:06 +02:00
vlofgren
ba375ef769 Tweaks to keyword extraction 2022-07-19 05:02:44 +02:00
vlofgren
825dea839d Tweaks to keyword extraction 2022-07-19 04:50:19 +02:00
vlofgren
64844e1db2 While some might ask, why would the server host IP be available as a search keyword? I only ask you hold my beer as I make it a reality. 2022-07-19 03:01:23 +02:00
vlofgren
e83a7435c6 Raise min document length a tad, we've been getting a bit too much almost empty documents in the index. 2022-07-19 01:42:17 +02:00
vlofgren
9ae76a9264 Retire old and broken gemini support, needs to be re-implemented by having Memex talk to the API service rather than going directly to Search. 2022-07-18 18:36:39 +02:00
vlofgren
15bd54ef9f Tidy up LoaderMain a bit 2022-07-18 17:22:22 +02:00
vlofgren
3d1031f8e4 Add lexicon dumping utility 2022-07-18 17:13:47 +02:00
vlofgren
9f7a28cbdb Made search service more robust toward the case where Encyclopedia or Assistant is down 2022-07-17 22:21:41 +02:00
vlofgren
e22748e990 Better error logging for IO errors during conversion from configuration issues. 2022-07-17 22:08:06 +02:00
vlofgren
e30a20bb74 Fix bug in keyword loading when keywords have non-ASCII symbols, cleaner solution 2022-07-17 19:31:49 +02:00
vlofgren
f4966cf1f9 Fix bug in keyword loading when keywords have non-ASCII symbols 2022-07-17 15:18:16 +02:00
vlofgren
c5dbe269f7 Better logging for URL errors 2022-07-17 15:17:39 +02:00
vlofgren
89cca4dbff Better logging for rare parsing exception 2022-07-16 21:27:04 +02:00
vlofgren
80b3ac3dd8 Tweaking the URL block list to exclude git noise better 2022-07-16 21:19:13 +02:00
vlofgren
c71cc3d43a Fix overflow bugs in DictionaryHashMap that only surfaced without small RAM 2022-07-16 18:58:19 +02:00
vlofgren
661577b456 Add Fossil SCM commits to URL blocklist 2022-07-14 14:45:31 +02:00
vlofgren
20970a6161 Make processor more lenient toward quality, accept content-types which specify charset 2022-07-14 12:37:06 +02:00
vlofgren
e9a270c015 Merge branch 'master' into experimental 2022-07-14 10:28:01 +02:00
vlofgren
63d9c70667 Fix Memex Update Form Jank 2022-07-14 10:22:38 +02:00
vlofgren
fed2fa9397 Fix tiny NPE in converting 2022-07-11 23:25:03 +02:00
vlofgren
b0c40136ca Cleaned up HTML features code a bit. 2022-07-08 19:52:12 +02:00
vlofgren
7dea94d36d Cleaned up HTML features code a bit. 2022-07-08 17:25:16 +02:00
vlofgren
2b83e0d754 Block websites with "acceptable ads", as this seems a strong indicator the domain is either parked or spam. 2022-07-08 16:50:00 +02:00
vlofgren
7a4f5c27a6 Merge branch 'master' into experimental
# Conflicts:
#	marginalia_nu/src/e2e/resources/init.sh
2022-07-08 16:37:37 +02:00
vlofgren
f3be865293 Allow query params for *some* path,param combinations, targeted at allowing the crawl of forums. 2022-07-08 16:36:09 +02:00
vlofgren
93c274f1d4 E2E-test for memex 2022-07-08 12:34:31 +02:00
vlofgren
853108028e WIP: Selective URL param strings 2022-07-04 14:47:16 +02:00
vlofgren
ee07c4d94a Refactored s/DictionaryWriter/KeywordLexicon/g to use significantly less memory and (potentially) support UTF-8. 2022-06-26 16:44:08 +02:00
vlofgren
e1b3477115 Experiments in keyword extraction 2022-06-23 17:02:28 +02:00
vlofgren
4516b23f90 Also grab alt text for images in a-tags in anchor text extractor 2022-06-22 13:12:44 +02:00
vlofgren
48e4aa3ee8 Clean up old junk from the WordPatterns class 2022-06-22 13:01:46 +02:00
vlofgren
35878c5102 Anchor text capture work-in-progress 2022-06-22 12:57:58 +02:00
vlofgren
1068694db6 Refactoring BTreeReader and binary search code 2022-06-20 12:35:58 +02:00
vlofgren
8139ab0d1d Refactoring BTreeReader and binary search code 2022-06-20 12:28:15 +02:00
vlofgren
b1eff0107c Refactoring BTreeReader and binary search code 2022-06-20 12:25:34 +02:00
vlofgren
c324c80efc Refactoring BTreeReader and binary search code 2022-06-20 12:04:06 +02:00
vlofgren
420b9bb7e0 Refactoring BTreeReader and binary search code 2022-06-20 12:02:01 +02:00
vlofgren
f76af4ca79 Refactoring conversion 2022-06-18 15:54:58 +02:00
vlofgren
2e55599850 Revert "Revert "Merge branch 'experimental' into master""
This reverts commit 81c77e7fcb.
2022-06-16 14:09:57 +02:00
vlofgren
5ef953ae3d Fixing typo on front page. 2022-06-16 14:01:49 +02:00
vlofgren
81c77e7fcb Revert "Merge branch 'experimental' into master"
This reverts commit c3a432fdd4, reversing
changes made to 1de63f225d.
2022-06-15 16:49:18 +02:00
vlofgren
88908c203d Refactoring conversion 2022-06-15 16:34:03 +02:00
vlofgren
8ba80931a9 Restructuring index code: Move dictionary 2022-06-15 12:59:56 +02:00
vlofgren
89f894eae2 Merge branch 'master' into experimental 2022-06-14 17:55:36 +02:00
vlofgren
1de63f225d Added support for <base href>-style tags. 2022-06-14 17:55:14 +02:00
vlofgren
3e64003252 Re-add quality property to URLs 2022-06-09 22:19:29 +02:00
vlofgren
1ee0c2b572 Merge branch 'master' into experimental 2022-06-09 21:48:19 +02:00
vlofgren
389818c6c3 Make website url configurable for search engine redirects 2022-06-09 21:47:59 +02:00
vlofgren
65aee9419d Tidy up 2022-06-09 21:25:31 +02:00
vlofgren
495e6a1639 Use 64 bit path hash for EC_URL 2022-06-08 16:52:46 +02:00
vlofgren
2faaed3393 Fixed conversion bug SQL->EdgeDomainIndexingState 2022-06-08 16:52:33 +02:00
vlofgren
5e472fe121 WIP: Refactored ranking algorithms to separate database code from ranking code 2022-06-08 16:18:00 +02:00
vlofgren
026ba714b5 WIP: Database refactoring 2022-06-08 15:32:03 +02:00
vlofgren
c915664fcc WIP: Database refactoring 2022-06-07 22:34:53 +02:00
vlofgren
0e65384781 Make WMSA_HOME configurable through an environment variable. 2022-06-03 13:32:08 +02:00
vlofgren
d8d0c0e5b2 Make User-agent configurable. 2022-06-01 14:46:51 +02:00
vlofgren
80dad31753 Merge branch 'release'
# Conflicts:
#	marginalia_nu/src/main/java/nu/marginalia/wmsa/edge/index/service/query/IndexSearchBudget.java
2022-05-31 14:37:49 +02:00
vlofgren
c0e0579c8e Updated index.html for search engine to reflect changes in project status. 2022-05-31 14:35:05 +02:00