Commit Graph

229 Commits

Author SHA1 Message Date
vlofgren
13c8305dc2 Exclude some guaranteed-to-be-noncanonical forum URLs. 2022-09-16 17:12:07 +02:00
vlofgren
324c05fc42 Exclude some guaranteed-to-be-noncanonical forum URLs. 2022-09-16 17:01:06 +02:00
vlofgren
123603b0a3 Some small crawler tweaks, plus a test for examining crawler behavior through a simulated server. 2022-09-16 16:59:06 +02:00
vlofgren
5e67391829 Some small crawler tweaks, plus a test for examining crawler behavior through a simulated server. 2022-09-16 16:52:33 +02:00
vlofgren
23a7d91d5b Better index metrics, fix bug where domain result show up with advisory search terms. 2022-09-15 17:04:15 +02:00
vlofgren
9558077808 UX improvements for "show more results". 2022-09-15 15:56:20 +02:00
vlofgren
2e740bb7bd Add advisory search terms that do not affect ranking. 2022-09-14 16:31:37 +02:00
vlofgren
680693b6db Fix old broken domain search. 2022-09-13 20:57:04 +02:00
vlofgren
8d15ddbab0 Tune query timeouts and fetch window to speed up queries a bit. 2022-09-13 18:50:04 +02:00
vlofgren
6df02f7528 HyperLogLog-tool for figuring out how big the index is. 2022-09-13 18:27:36 +02:00
vlofgren
10d1307dd6 Fix a query variant creation bug that caused the search engine to sometimes drop important words from a query. 2022-09-12 23:32:49 +02:00
vlofgren
297f8e4cd7 Fixing a bug where search terms would sometimes be ignored, tweaking timeouts, adding debug feature for the search service. 2022-09-12 21:08:53 +02:00
vlofgren
7749ce645a Further more cleaning 2022-09-12 10:39:02 +02:00
vlofgren
971089bad3 Cleaning up. 2022-09-11 11:58:39 +02:00
vlofgren
eaef93f4ae Cleaning up and adding better error messages. 2022-09-11 11:31:22 +02:00
vlofgren
fbe17b62ed Giga-refactor of the index query logic 2022-09-10 20:28:45 +02:00
vlofgren
c6976acdfc WIP Loading 2022-09-05 17:51:49 +02:00
vlofgren
c912d3127d Better hints. 2022-09-03 18:35:04 +02:00
vlofgren
2e3d95bcb1 Refactoring and cleanup 2022-09-03 17:32:53 +02:00
vlofgren
5a4d41d414 Refactoring and cleanup, WIP 2022-09-03 15:20:26 +02:00
vlofgren
26e0cfec3a Preparation for conversion 2022-09-02 17:45:03 +02:00
vlofgren
ccf79f47b0 Preparation for conversion 2022-09-02 14:51:11 +02:00
vlofgren
a04d27692e Merge branch 'master' into experimental-22-08 2022-09-02 11:29:30 +02:00
vlofgren
578ecfb27d CSS tweaks for search. 2022-09-02 10:58:07 +02:00
vlofgren
3fd48e0e53 Cleaning the code a bit, fix URL loading bug with multiple fragments in URL 2022-09-02 10:41:02 +02:00
vlofgren
5dd61387bf Merge branch 'master' into experimental-22-08 2022-09-02 09:39:20 +02:00
vlofgren
5b8dc18d81 Fix copy errrors in index.hdb 2022-09-02 09:35:19 +02:00
vlofgren
9270230065 WIP logic for detecting significant images in the body of a website. 2022-09-02 09:35:19 +02:00
vlofgren
5f993c72dd Tweaks for search result relevance 2022-09-02 09:34:20 +02:00
vlofgren
813399401e Tweaks for search result relevance 2022-08-29 18:01:07 +02:00
vlofgren
3f2854a5e9 WIP n-gram loader 2022-08-27 20:30:18 +02:00
vlofgren
0282156979 WIP n-gram loader 2022-08-27 19:19:16 +02:00
vlofgren
c865d6c6b2 Change TF-IDF normalization to reduce the amount of not-so-relevant matches. 2022-08-27 11:38:29 +02:00
vlofgren
f4ad7aaf33 Remove accidental import of an unused library,
fix build on jdk18-systems.
2022-08-26 20:48:44 +02:00
vlofgren
3200c36072 Experimental changes for 22-08/09 update. 2022-08-26 16:08:46 +02:00
vlofgren
db056be06a WIP logic for detecting significant images in the body of a website. 2022-08-24 22:05:32 +02:00
vlofgren
c6db2aad48 Fixed stylesheet for search to make random websites button more prominent. 2022-08-24 19:29:00 +02:00
vlofgren
69b9f93dc6 Fixed stylesheet for search to make random websites button more prominent. 2022-08-24 19:28:06 +02:00
vlofgren
9cf78d6929 Bugfixes for the crawler: Better charset support, better 429 handling, better error handling, fixed resource leak. 2022-08-24 19:27:46 +02:00
vlofgren
407ec39c0c Use links index for site suggestions. 2022-08-24 04:41:26 +02:00
vlofgren
e1a726babf Use links index for site suggestions. 2022-08-24 03:50:08 +02:00
vlofgren
4c8c8f5140 Use links index for site suggestions. 2022-08-24 03:45:09 +02:00
vlofgren
961ef2a930 Serve assets from search service instead of resource-store,
dynamically render index for future goodies,
css tweaks.
2022-08-24 00:41:20 +02:00
vlofgren
ee0580273e Serve assets from search service instead of resource-store,
dynamically render index for future goodies,
css tweaks.
2022-08-24 00:35:22 +02:00
vlofgren
db4cf70784 Reduce resource consumption during crawling,
reduce TIME_WAIT sockets with a custom socket
factory.
2022-08-23 13:26:37 +02:00
vlofgren
6fc72b3eb8 Clean up feature extraction, fix misidentification of 'application/ld+json' as javascript. 2022-08-23 00:48:48 +02:00
vlofgren
6e2fdb7a77 Reduce crawling memory consumption,
Increase crawling threads,
Dynamically adjust crawling rate.
2022-08-23 00:35:45 +02:00
vlofgren
fc9d9d1bad And revert the previous change as my IP got kicked back to ol' reliable '81.170.128.52' 2022-08-22 17:32:56 +02:00
vlofgren
087ad0124d Update crawler IP file to reflect the fact that the IP changed. 2022-08-22 13:04:07 +02:00
vlofgren
095ed7c6c4 Tweak CSS a tiny bit to add more padding to the right of info cells. 2022-08-19 16:07:26 +02:00