Viktor Lofgren
67c15a34e6
Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
9579cdd151
Improved heuristic for which words are considered important in selecting the summary text.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
443cf0cf1e
Expose additional functionality through WordsTfIdfCounts.
...
Bump requirements for being flagged as high TF-IDF from 2 occurences to 3.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
4138233ddf
Truncate repeated strings of any non-alnum symbols in SummaryExtractor
2023-06-19 17:58:19 +02:00
Viktor Lofgren
2979f4703e
Allocation-free text utility
2023-06-19 17:58:19 +02:00
Viktor Lofgren
77f2ca51af
Optimize SentenceExtractor.
...
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
ffcbc6c1c9
Reduce the odds of re-allocation by AsciiFlattener
2023-06-19 17:58:19 +02:00
Viktor Lofgren
186a02acfd
Optimize RDRPosTagger to use integer comparisons instead of string comparisons.
...
Also reduce the cache-thrashing by deconstructing the tree's nodes into arrays.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
6f2a7977c1
(Minor) Remove character debris in build.gradle
2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de
Re-introduce monkey patched GSON to make converter run better.
...
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
d1a004bea6
(minor) Clean up StringPool
2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5
Use fixed buffers for BigString compression and decompression to reduce GC churn.
...
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
379bccc1a3
Disable AdblockSimulator since it's slow and doesn't really work. Just wasting CPU cycles until it's fixed.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
21125206b4
Fix some bugs in JSON+LD-heuristics for pub date.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d
Move list-conversion into getDescription method.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2
Consider keyword relevance signals when creating the document summary using the DOM walker.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
7ed3306be3
Make the adjacency calculator behave like it used to in the past, when it gave better results.
2023-06-07 22:03:06 +02:00
Viktor Lofgren
eb2ca942d5
Up the default crawl delay to 1 second.
2023-06-07 22:02:17 +02:00
Viktor Lofgren
2afbdc2269
Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously.
2023-06-07 22:01:35 +02:00
Viktor Lofgren
d82a858491
Don't consider slash to be a sentence separator.
2023-05-31 16:54:30 +02:00
Viktor Lofgren
e332faa07e
Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu.
2023-05-28 13:46:24 +02:00
Viktor Lofgren
4e9e79454f
Fix broken transformation functions in the PagingArray classes.
2023-05-28 13:31:05 +02:00
Viktor Lofgren
b0bc07b4e7
Insertion sort was *super* busted I don't even know how it worked
2023-05-28 12:17:50 +02:00
Viktor Lofgren
2cda57355a
More word metadata tests
2023-05-28 11:57:06 +02:00
Viktor Lofgren
fd192d2791
Fix putative overflow error with a large dictionary
2023-05-28 11:57:06 +02:00
Viktor Lofgren
6814c90625
Fix N-width sorting bug
2023-05-28 11:57:06 +02:00
Viktor
a57ab427b3
Update useful-resources.md
2023-05-27 12:01:45 +02:00
Viktor Lofgren
1e184a8372
(search) Make exploration mode more random
2023-05-25 17:40:28 +02:00
Viktor Lofgren
6fae51a8ef
Stopgap fix for a bug in dealing with quote terms containing stop words.
2023-05-02 19:38:59 +02:00
Viktor Lofgren
a9f7b4c457
Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document.
2023-04-30 19:29:13 +02:00
Viktor Lofgren
1e3b6934bb
Reduce log noise during loading. Bad URLs don't need to be loaded, they can be grepped from the instructions.
2023-04-30 18:36:44 +02:00
Viktor
0a5e85be8f
Update README.md
2023-04-22 21:02:25 +02:00
Viktor
7694a15f62
Fix kale's unreasonably high weighting factor
2023-04-22 20:55:09 +02:00
Viktor
d72da01a92
Update readme.md
2023-04-22 16:05:57 +02:00
Viktor
112f43b3a1
Api service response cache ( #16 )
...
* Add response caching to the API service to help SearXNG
* Clean up the code a bit.
* Add an endpoint without a terminating slash for getLicense.
* Add tests for API service.
2023-04-22 15:42:32 +02:00
Viktor Lofgren
f12c6fd57e
Add a ranking parameter for biasing toward recent or old content.
2023-04-20 16:00:59 +02:00
Viktor
96bac70b85
Tools for merging sorted lists, and merging btrees. ( #14 )
...
* Utilities for merging BTrees of entity size 1 and 2.
* Isolate and clean up sorting algorithms.
* Functions for keeping distinct items in a LongArray
2023-04-20 15:28:09 +02:00
Viktor Lofgren
619fb8ba80
(converter) Adjust the pub-date sniffing heuristics' order. Doing HTML5 tags too early puts some sites too early. Also expanded support for JSON+LD.
2023-04-19 15:28:50 +02:00
Viktor
5a5cdaf70e
Improvements to the adjacency calculator and screenshots tool ( #13 )
...
* WIP: Improvements to website adjacencies loader tool.
* Improving screenshots capture bot.
2023-04-18 22:21:49 +02:00
Viktor Lofgren
bb587ca47f
Reformulate search-header.hdb, s/Support/Donate/ the formulation was apparently confusing some people thinking they could get support on this page.
2023-04-18 17:04:24 +02:00
Viktor Lofgren
4d298cd5fa
Improving screenshots capture bot.
2023-04-17 18:04:22 +02:00
Viktor Lofgren
fbbaf584ba
Adjustments to screenshot capture tool.
2023-04-16 08:55:57 +02:00
Viktor Lofgren
df1850bd45
Fix bug in index service where tld: and links:-queries wouldn't work.
2023-04-15 18:39:16 +02:00
Viktor Lofgren
d42ab19166
Issue 5: Fix bug where some IPv6 addresses blew up domain loading.
2023-04-15 14:11:08 +02:00
Viktor Lofgren
2ab26f37b8
Bug fix for document metadata encoding that breaks year based queries.
2023-04-14 16:56:49 +02:00
Viktor
ec7ce7b0b3
Update readme.md
2023-04-11 16:31:11 +02:00
Viktor Lofgren
3e9b37c264
Refactor website screenshot tool and website adjacencies calculator into code/tools.
2023-04-11 16:20:27 +02:00
Viktor Lofgren
502713f7a8
Reduce memory churn
2023-04-10 16:51:17 +02:00
Viktor Lofgren
e19256a6b6
Tune settings to retrieve more results.
2023-04-10 15:39:20 +02:00
Viktor Lofgren
ccc41d1717
Clean up of the index query handling related code.
2023-04-10 14:50:57 +02:00