Viktor Lofgren
b0c7480d06
Set default timeouts for java.net.URL-connections
2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151
Tests for crawler specialization + testdata
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0
Sitemap support, refined crawler specialization
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61
Refactor crawler and add special logic for some platforms
...
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
bd2c3855ed
Add bits and keywords for generator classes (docs, forum, wiki).
2023-06-23 21:35:28 +02:00
Viktor Lofgren
b5ef67ed28
Categorize generators by type
...
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7
Use a default tag for unset or invalid generators.
2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86
New synthetic keyword for document generator meta tag.
2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe
Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
...
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
67c15a34e6
Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de
Re-introduce monkey patched GSON to make converter run better.
...
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5
Use fixed buffers for BigString compression and decompression to reduce GC churn.
...
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d
Move list-conversion into getDescription method.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2
Consider keyword relevance signals when creating the document summary using the DOM walker.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
eb2ca942d5
Up the default crawl delay to 1 second.
2023-06-07 22:02:17 +02:00
Viktor Lofgren
e332faa07e
Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu.
2023-05-28 13:46:24 +02:00
Viktor Lofgren
a9f7b4c457
Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document.
2023-04-30 19:29:13 +02:00
Viktor Lofgren
1e3b6934bb
Reduce log noise during loading. Bad URLs don't need to be loaded, they can be grepped from the instructions.
2023-04-30 18:36:44 +02:00
Viktor
112f43b3a1
Api service response cache ( #16 )
...
* Add response caching to the API service to help SearXNG
* Clean up the code a bit.
* Add an endpoint without a terminating slash for getLicense.
* Add tests for API service.
2023-04-22 15:42:32 +02:00
Viktor Lofgren
d42ab19166
Issue 5: Fix bug where some IPv6 addresses blew up domain loading.
2023-04-15 14:11:08 +02:00
Viktor Lofgren
2ab26f37b8
Bug fix for document metadata encoding that breaks year based queries.
2023-04-14 16:56:49 +02:00
Viktor Lofgren
cc4e089a5d
Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc.
2023-03-30 15:46:15 +02:00
Viktor Lofgren
4d05be4095
Refactor InternalLinkGraph
2023-03-30 15:44:23 +02:00
Viktor Lofgren
8f51345a1d
Add experiment runner tool and got rid of experiments module in processes.
2023-03-28 16:58:46 +02:00
Viktor Lofgren
03bd892b95
Improve document processing in conversion.
...
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor
ac1ac3ea57
Move database to a separate module
...
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor Lofgren
ca22c287a5
Make use of DocumentFlags' flags
2023-03-21 16:03:15 +01:00
Viktor Lofgren
2eb972dea1
Remove unrelated code, break tools into their own directory.
2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076
Yet more restructuring. Improved search result ranking.
2023-03-16 21:35:54 +01:00
Viktor Lofgren
d82532b7f1
More restructuring, big bug fixes in keyword extraction.
2023-03-13 17:39:53 +01:00