Viktor Lofgren
|
df49ccbe59
|
October Release (#118)
Co-authored-by: vlofgren <vlofgren@gmail.com>
Co-authored-by: vlofgren <vlofgren@marginalia.nu>
Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/118
|
2022-10-19 15:00:04 +02:00 |
|
vlofgren
|
9a7d052c43
|
Adjustments to anchor tag extraction.
|
2022-09-18 10:59:16 +02:00 |
|
vlofgren
|
179d54d50a
|
Processor fixes: Excluding phpinfo()-pages, mastodon feeds.
|
2022-09-16 18:05:54 +02:00 |
|
vlofgren
|
13c8305dc2
|
Exclude some guaranteed-to-be-noncanonical forum URLs.
|
2022-09-16 17:12:07 +02:00 |
|
vlofgren
|
324c05fc42
|
Exclude some guaranteed-to-be-noncanonical forum URLs.
|
2022-09-16 17:01:06 +02:00 |
|
vlofgren
|
123603b0a3
|
Some small crawler tweaks, plus a test for examining crawler behavior through a simulated server.
|
2022-09-16 16:59:06 +02:00 |
|
vlofgren
|
5e67391829
|
Some small crawler tweaks, plus a test for examining crawler behavior through a simulated server.
|
2022-09-16 16:52:33 +02:00 |
|
vlofgren
|
23a7d91d5b
|
Better index metrics, fix bug where domain result show up with advisory search terms.
|
2022-09-15 17:04:15 +02:00 |
|
vlofgren
|
9558077808
|
UX improvements for "show more results".
|
2022-09-15 15:56:20 +02:00 |
|
vlofgren
|
2e740bb7bd
|
Add advisory search terms that do not affect ranking.
|
2022-09-14 16:31:37 +02:00 |
|
vlofgren
|
680693b6db
|
Fix old broken domain search.
|
2022-09-13 20:57:04 +02:00 |
|
vlofgren
|
8d15ddbab0
|
Tune query timeouts and fetch window to speed up queries a bit.
|
2022-09-13 18:50:04 +02:00 |
|
vlofgren
|
6df02f7528
|
HyperLogLog-tool for figuring out how big the index is.
|
2022-09-13 18:27:36 +02:00 |
|
vlofgren
|
10d1307dd6
|
Fix a query variant creation bug that caused the search engine to sometimes drop important words from a query.
|
2022-09-12 23:32:49 +02:00 |
|
vlofgren
|
297f8e4cd7
|
Fixing a bug where search terms would sometimes be ignored, tweaking timeouts, adding debug feature for the search service.
|
2022-09-12 21:08:53 +02:00 |
|
vlofgren
|
7749ce645a
|
Further more cleaning
|
2022-09-12 10:39:02 +02:00 |
|
vlofgren
|
971089bad3
|
Cleaning up.
|
2022-09-11 11:58:39 +02:00 |
|
vlofgren
|
eaef93f4ae
|
Cleaning up and adding better error messages.
|
2022-09-11 11:31:22 +02:00 |
|
vlofgren
|
fbe17b62ed
|
Giga-refactor of the index query logic
|
2022-09-10 20:28:45 +02:00 |
|
vlofgren
|
c6976acdfc
|
WIP Loading
|
2022-09-05 17:51:49 +02:00 |
|
vlofgren
|
c912d3127d
|
Better hints.
|
2022-09-03 18:35:04 +02:00 |
|
vlofgren
|
2e3d95bcb1
|
Refactoring and cleanup
|
2022-09-03 17:32:53 +02:00 |
|
vlofgren
|
5a4d41d414
|
Refactoring and cleanup, WIP
|
2022-09-03 15:20:26 +02:00 |
|
vlofgren
|
26e0cfec3a
|
Preparation for conversion
|
2022-09-02 17:45:03 +02:00 |
|
vlofgren
|
ccf79f47b0
|
Preparation for conversion
|
2022-09-02 14:51:11 +02:00 |
|
vlofgren
|
a04d27692e
|
Merge branch 'master' into experimental-22-08
|
2022-09-02 11:29:30 +02:00 |
|
vlofgren
|
578ecfb27d
|
CSS tweaks for search.
|
2022-09-02 10:58:07 +02:00 |
|
vlofgren
|
3fd48e0e53
|
Cleaning the code a bit, fix URL loading bug with multiple fragments in URL
|
2022-09-02 10:41:02 +02:00 |
|
vlofgren
|
5dd61387bf
|
Merge branch 'master' into experimental-22-08
|
2022-09-02 09:39:20 +02:00 |
|
vlofgren
|
5b8dc18d81
|
Fix copy errrors in index.hdb
|
2022-09-02 09:35:19 +02:00 |
|
vlofgren
|
9270230065
|
WIP logic for detecting significant images in the body of a website.
|
2022-09-02 09:35:19 +02:00 |
|
vlofgren
|
5f993c72dd
|
Tweaks for search result relevance
|
2022-09-02 09:34:20 +02:00 |
|
vlofgren
|
813399401e
|
Tweaks for search result relevance
|
2022-08-29 18:01:07 +02:00 |
|
vlofgren
|
3f2854a5e9
|
WIP n-gram loader
|
2022-08-27 20:30:18 +02:00 |
|
vlofgren
|
0282156979
|
WIP n-gram loader
|
2022-08-27 19:19:16 +02:00 |
|
vlofgren
|
c865d6c6b2
|
Change TF-IDF normalization to reduce the amount of not-so-relevant matches.
|
2022-08-27 11:38:29 +02:00 |
|
vlofgren
|
f4ad7aaf33
|
Remove accidental import of an unused library,
fix build on jdk18-systems.
|
2022-08-26 20:48:44 +02:00 |
|
vlofgren
|
3200c36072
|
Experimental changes for 22-08/09 update.
|
2022-08-26 16:08:46 +02:00 |
|
vlofgren
|
db056be06a
|
WIP logic for detecting significant images in the body of a website.
|
2022-08-24 22:05:32 +02:00 |
|
vlofgren
|
c6db2aad48
|
Fixed stylesheet for search to make random websites button more prominent.
|
2022-08-24 19:29:00 +02:00 |
|
vlofgren
|
69b9f93dc6
|
Fixed stylesheet for search to make random websites button more prominent.
|
2022-08-24 19:28:06 +02:00 |
|
vlofgren
|
9cf78d6929
|
Bugfixes for the crawler: Better charset support, better 429 handling, better error handling, fixed resource leak.
|
2022-08-24 19:27:46 +02:00 |
|
vlofgren
|
407ec39c0c
|
Use links index for site suggestions.
|
2022-08-24 04:41:26 +02:00 |
|
vlofgren
|
e1a726babf
|
Use links index for site suggestions.
|
2022-08-24 03:50:08 +02:00 |
|
vlofgren
|
4c8c8f5140
|
Use links index for site suggestions.
|
2022-08-24 03:45:09 +02:00 |
|
vlofgren
|
961ef2a930
|
Serve assets from search service instead of resource-store,
dynamically render index for future goodies,
css tweaks.
|
2022-08-24 00:41:20 +02:00 |
|
vlofgren
|
ee0580273e
|
Serve assets from search service instead of resource-store,
dynamically render index for future goodies,
css tweaks.
|
2022-08-24 00:35:22 +02:00 |
|
vlofgren
|
db4cf70784
|
Reduce resource consumption during crawling,
reduce TIME_WAIT sockets with a custom socket
factory.
|
2022-08-23 13:26:37 +02:00 |
|
vlofgren
|
6fc72b3eb8
|
Clean up feature extraction, fix misidentification of 'application/ld+json' as javascript.
|
2022-08-23 00:48:48 +02:00 |
|
vlofgren
|
6e2fdb7a77
|
Reduce crawling memory consumption,
Increase crawling threads,
Dynamically adjust crawling rate.
|
2022-08-23 00:35:45 +02:00 |
|
vlofgren
|
fc9d9d1bad
|
And revert the previous change as my IP got kicked back to ol' reliable '81.170.128.52'
|
2022-08-22 17:32:56 +02:00 |
|
vlofgren
|
087ad0124d
|
Update crawler IP file to reflect the fact that the IP changed.
|
2022-08-22 13:04:07 +02:00 |
|
vlofgren
|
095ed7c6c4
|
Tweak CSS a tiny bit to add more padding to the right of info cells.
|
2022-08-19 16:07:26 +02:00 |
|
vlofgren
|
2adbe5f74c
|
Update publicity roll.
|
2022-08-19 15:55:01 +02:00 |
|
vlofgren
|
56987f6664
|
Update publicity roll.
|
2022-08-19 15:50:15 +02:00 |
|
vlofgren
|
7567890708
|
Update publicity roll.
|
2022-08-19 15:49:52 +02:00 |
|
vlofgren
|
ede62f2515
|
Retain cookies for domain.
|
2022-08-18 20:44:44 +02:00 |
|
vlofgren
|
a1eb8375a2
|
Exclude wp-content/uploads from crawling
|
2022-08-18 19:05:07 +02:00 |
|
vlofgren
|
340d80f6c7
|
Don't try to fetch text/css and text/javascript-files. Refactor fetcher to separate content type sniffing logic. Clean up crawler a smidge.
|
2022-08-18 18:40:34 +02:00 |
|
vlofgren
|
6b6cd56e3a
|
Don't try to fetch text/css and text/javascript-files. Refactor fetcher to separate content type sniffing logic. Clean up crawler a smidge.
|
2022-08-18 18:25:12 +02:00 |
|
vlofgren
|
4afccdc536
|
Don't try to fetch ftp://, webcal://, etc.
|
2022-08-18 17:25:22 +02:00 |
|
vlofgren
|
5cd552458a
|
Fix fragment bug.
|
2022-08-18 16:47:59 +02:00 |
|
vlofgren
|
2bc81e8e9a
|
Fix fragment bug.
|
2022-08-18 16:45:51 +02:00 |
|
vlofgren
|
a034e3245e
|
Fix fragment bug.
|
2022-08-18 16:43:34 +02:00 |
|
vlofgren
|
0bac422091
|
Fix bug in redirect handling that caused the crawler to not index some documents.
|
2022-08-17 00:51:10 +02:00 |
|
vlofgren
|
ce9abc00dc
|
Fix bug in redirect handling that caused the crawler to not index some documents.
|
2022-08-17 00:49:32 +02:00 |
|
vlofgren
|
5cfef610b0
|
Preparations for new crawl round
|
2022-08-16 22:48:16 +02:00 |
|
vlofgren
|
123675d73b
|
More caching
|
2022-08-15 15:39:10 +02:00 |
|
vlofgren
|
ceacfa5917
|
Tune down log spam
|
2022-08-15 15:37:26 +02:00 |
|
vlofgren
|
f6b3e75cee
|
Optimize search service by removing weird query spam
|
2022-08-15 15:27:22 +02:00 |
|
vlofgren
|
beafdfda9c
|
Index optimizations that should reduce small object churn and IOPS a bit.
|
2022-08-15 13:58:18 +02:00 |
|
vlofgren
|
460dd098b0
|
Add advertisement Feature to search,
Add adblock simulation to processor,
Add filename and email address extraction to processor.
|
2022-08-12 17:12:16 +02:00 |
|
vlofgren
|
30d2a707ff
|
Add advertisement Feature to search,
Add adblock simulation to processor,
Add filename and email address extraction to processor.
|
2022-08-12 13:50:18 +02:00 |
|
vlofgren
|
0e28ff5a72
|
Add features to suggestions
|
2022-08-10 21:32:19 +02:00 |
|
vlofgren
|
ba9e0d9829
|
Add features to suggestions
|
2022-08-10 19:50:14 +02:00 |
|
vlofgren
|
ffde8c8305
|
Faster crawling
|
2022-08-10 18:46:13 +02:00 |
|
vlofgren
|
ce09fce639
|
Faster crawling
|
2022-08-10 17:03:58 +02:00 |
|
vlofgren
|
9c6e3b1772
|
Topical detection (experimental),
Adblock simulation (experimental)
|
2022-08-10 15:04:29 +02:00 |
|
vlofgren
|
d7167f956e
|
Adjust search result sort order to penalize scriptiness a bit
|
2022-08-08 18:59:57 +02:00 |
|
vlofgren
|
0f59675f7c
|
Clean up preconverter code
|
2022-08-08 18:08:18 +02:00 |
|
vlofgren
|
2af2c50f34
|
Clean up preconverter code
|
2022-08-08 15:29:47 +02:00 |
|
vlofgren
|
2bfde9d030
|
Recipe detection
|
2022-08-08 15:18:18 +02:00 |
|
vlofgren
|
0dfcf2f7af
|
Recipe detection
|
2022-08-08 15:18:07 +02:00 |
|
vlofgren
|
5c952d48f4
|
Speed up conversion
|
2022-08-08 15:18:07 +02:00 |
|
vlofgren
|
e39320d51d
|
Add support for additional random sets
|
2022-08-07 17:51:35 +02:00 |
|
vlofgren
|
b9bbda0e2e
|
Add support for additional random sets
|
2022-08-07 17:49:32 +02:00 |
|
vlofgren
|
743ba23f55
|
Add support for additional random sets
|
2022-08-07 17:46:30 +02:00 |
|
vlofgren
|
5fbafa63c1
|
Add better fallbacks to summary extractor
|
2022-08-06 15:17:00 +02:00 |
|
vlofgren
|
e22fde69ed
|
Screenshot bot
|
2022-08-04 21:14:17 +02:00 |
|
vlofgren
|
a6a6bdb013
|
Test rewarding linked terms.
|
2022-08-02 17:52:24 +02:00 |
|
vlofgren
|
6e68f930a6
|
Test rewarding linked terms.
|
2022-08-02 17:50:25 +02:00 |
|
vlofgren
|
0b61910b84
|
Test rewarding linked terms.
|
2022-08-02 17:43:21 +02:00 |
|
vlofgren
|
487d74592d
|
Test rewarding linked terms.
|
2022-08-02 17:38:18 +02:00 |
|
vlofgren
|
ae2419e2a5
|
Reduced max domain results for search command,
made it easier to configure.
|
2022-08-02 12:23:24 +02:00 |
|
vlofgren
|
c9eef92291
|
Updated opensearch def with hint to use api for automation.
|
2022-08-02 12:23:24 +02:00 |
|
vlofgren
|
3ccb1c6218
|
Simplified query builders, preparation for a-tag inclusion.
|
2022-08-01 20:29:15 +02:00 |
|
vlofgren
|
9a4183a481
|
A-tags loader
|
2022-08-01 20:05:55 +02:00 |
|
vlofgren
|
9a6c8339d0
|
Clean up DAO
|
2022-08-01 20:05:21 +02:00 |
|
vlofgren
|
7f985c0a57
|
Experimental domain-searching feature
|
2022-07-28 21:33:36 +02:00 |
|
vlofgren
|
e17d3015dc
|
Experimental domain-searching feature
|
2022-07-28 21:29:34 +02:00 |
|