Viktor Lofgren
f59cab300e
(minor) Javadoc comments for MqPersistance and MqMessageState
2023-07-10 21:59:51 +02:00
Viktor Lofgren
ec7826659a
(minor) Javadoc comments for MqPersistance and MqMessageState
2023-07-10 21:52:25 +02:00
Viktor Lofgren
98b5f22104
(control) WIP control service
...
* Set messages to OK when received so they're cleaned up properly.
2023-07-10 21:33:57 +02:00
Viktor Lofgren
2283ceb77d
(control) WIP control service
2023-07-10 18:58:43 +02:00
Viktor Lofgren
fba466d6e2
(crawler) Update URL blocklist
...
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:58:43 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
c125d8ab48
(search) Fix a bug where space-like characters weren't normalized in query processing.
2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b
(crawler) Fix bug poor handling of duplicate ids
...
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor Lofgren
dbb758d1a8
Minor: Better error handling in crawled domain reader
2023-07-10 18:58:43 +02:00
Viktor Lofgren
da8bcc6e24
Minor: Don't blow up the reader on a corrupted file
2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5
Minor: Readability.
2023-07-10 18:58:43 +02:00
Viktor Lofgren
98d1898610
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:12:26 +02:00
Viktor Lofgren
b73fcc19fe
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
2023-07-06 18:05:03 +02:00
Viktor Lofgren
d9e6c4f266
Trial integration of MQ-FSM into index service.
2023-07-06 18:04:16 +02:00
Viktor Lofgren
34653f03a2
Temporary bugfix, need to find source
2023-07-06 14:13:03 +02:00
Viktor Lofgren
f0a8ca440f
MQFSM Usability WIP
2023-07-06 13:33:11 +02:00
Viktor Lofgren
d89db10645
MQFSM Usability WIP
2023-07-06 13:02:16 +02:00
Adrthegamedev
78f21dd19a
(an attempt to) Add wikidot to wiki generators list
2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c
Better wordpress fingerprinting
2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-05 18:03:36 +02:00
Viktor Lofgren
7a17933c65
Control service owns message queue garbage collection.
2023-07-04 19:52:30 +02:00
Viktor Lofgren
097a163cf5
Getting a skeleton in place for the control service.
2023-07-04 18:25:42 +02:00
Viktor Lofgren
2ae0b8c159
Message queue based state machine
2023-07-04 17:42:06 +02:00
Viktor Lofgren
31ae71c7d6
Message queue WIP
2023-07-04 14:28:14 +02:00
Viktor Lofgren
62cc9df206
Embryo of new control process
...
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
42375f0e53
Specialization for javadocs
2023-07-01 20:16:56 +02:00
Viktor Lofgren
24dce8c03b
Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern.
2023-07-01 19:32:25 +02:00
Viktor Lofgren
eda615de0f
Add generator fingerprint for invision.
2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223
Add generator fingerprint for xenforo.
...
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58
Add generator fingerprint for xenforo.
2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e
Add generator fingerprints for phpBB and flarum.
2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a
Big brain web developers were using onload and onerror handlers to load JS without script tags...
2023-06-30 17:10:25 +02:00
Viktor Lofgren
7d86586594
Remove annoying log spam in sitemap retriever
2023-06-30 17:08:35 +02:00
Viktor Lofgren
11c26e700e
Remove annoying log spam in crawler retriever
2023-06-30 17:08:24 +02:00
Viktor Lofgren
8274e8a953
JVM flags for disabling black and block-lists.
2023-06-30 17:07:47 +02:00
Viktor Lofgren
0f34beb1aa
Update search front page
2023-06-29 17:14:27 +02:00
Viktor Lofgren
baff83912e
Small optimizations that shave an hour of processing time :D
2023-06-28 15:41:10 +02:00
Viktor Lofgren
d71124961e
Better tests for crawling and processing.
2023-06-27 16:11:27 +02:00
Viktor Lofgren
fbdedf53de
Fix bug in CrawlerRetreiver
...
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
a6a66c6d8a
Improve site info for unknown domains:
...
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
d167ad2017
Remove sitemap related log spam
2023-06-27 13:59:47 +02:00
Viktor Lofgren
7d741ff499
Fix so crawl plan replay doesn't crash if a file is missing.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
f8f9f04158
Specialized logic for processing Lemmy-based websites.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06
Set default timeouts for java.net.URL-connections
2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151
Tests for crawler specialization + testdata
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0
Sitemap support, refined crawler specialization
2023-06-27 10:57:54 +02:00
Viktor Lofgren
f92d8a0975
EdgeUrl conversion to/from java.net.URL
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61
Refactor crawler and add special logic for some platforms
...
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
5abaf13192
Fix serialization bug with CompressedBigString
2023-06-27 10:57:54 +02:00
Viktor Lofgren
d86e8522e2
Add search profiles for wiki, forum and docs.
2023-06-24 12:17:35 +02:00
Viktor Lofgren
bd2c3855ed
Add bits and keywords for generator classes (docs, forum, wiki).
2023-06-23 21:35:28 +02:00
Viktor Lofgren
54c2be893b
TRIVIAL: Remove unused import.
2023-06-22 17:21:47 +02:00
Viktor Lofgren
55c65f0935
Use document generator to complement the document selection.
...
Will let through e.g. a modern SSG in the small web filter.
2023-06-22 17:21:33 +02:00
Viktor Lofgren
b5ef67ed28
Categorize generators by type
...
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7
Use a default tag for unset or invalid generators.
2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86
New synthetic keyword for document generator meta tag.
2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe
Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
...
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
a9fabba407
Tell experiment runner to only process some domains.
...
Updated the experiment runner, as well as the script.
2023-06-20 14:14:01 +02:00
Viktor Lofgren
4fc0ddbc45
Improved crawl-job-extractor.
...
Let crawl-job-extractor run offline and allow it to read domains from file.
Improved docs.
2023-06-20 11:37:52 +02:00
Viktor Lofgren
9455100907
Throw a custom exception when WMSA_HOME isn't found
2023-06-20 11:37:52 +02:00
Viktor Lofgren
32a6735d03
Undo change in requirements for counting as a high tf-idf word
2023-06-19 17:58:19 +02:00
Viktor Lofgren
f0b4acb358
Better logic for summarization.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
67c15a34e6
Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
9579cdd151
Improved heuristic for which words are considered important in selecting the summary text.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
443cf0cf1e
Expose additional functionality through WordsTfIdfCounts.
...
Bump requirements for being flagged as high TF-IDF from 2 occurences to 3.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
4138233ddf
Truncate repeated strings of any non-alnum symbols in SummaryExtractor
2023-06-19 17:58:19 +02:00
Viktor Lofgren
2979f4703e
Allocation-free text utility
2023-06-19 17:58:19 +02:00
Viktor Lofgren
77f2ca51af
Optimize SentenceExtractor.
...
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
ffcbc6c1c9
Reduce the odds of re-allocation by AsciiFlattener
2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de
Re-introduce monkey patched GSON to make converter run better.
...
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
d1a004bea6
(minor) Clean up StringPool
2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5
Use fixed buffers for BigString compression and decompression to reduce GC churn.
...
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
379bccc1a3
Disable AdblockSimulator since it's slow and doesn't really work. Just wasting CPU cycles until it's fixed.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
21125206b4
Fix some bugs in JSON+LD-heuristics for pub date.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d
Move list-conversion into getDescription method.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2
Consider keyword relevance signals when creating the document summary using the DOM walker.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
7ed3306be3
Make the adjacency calculator behave like it used to in the past, when it gave better results.
2023-06-07 22:03:06 +02:00
Viktor Lofgren
eb2ca942d5
Up the default crawl delay to 1 second.
2023-06-07 22:02:17 +02:00
Viktor Lofgren
2afbdc2269
Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously.
2023-06-07 22:01:35 +02:00
Viktor Lofgren
d82a858491
Don't consider slash to be a sentence separator.
2023-05-31 16:54:30 +02:00
Viktor Lofgren
e332faa07e
Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu.
2023-05-28 13:46:24 +02:00
Viktor Lofgren
4e9e79454f
Fix broken transformation functions in the PagingArray classes.
2023-05-28 13:31:05 +02:00
Viktor Lofgren
b0bc07b4e7
Insertion sort was *super* busted I don't even know how it worked
2023-05-28 12:17:50 +02:00
Viktor Lofgren
2cda57355a
More word metadata tests
2023-05-28 11:57:06 +02:00
Viktor Lofgren
fd192d2791
Fix putative overflow error with a large dictionary
2023-05-28 11:57:06 +02:00
Viktor Lofgren
6814c90625
Fix N-width sorting bug
2023-05-28 11:57:06 +02:00
Viktor Lofgren
1e184a8372
(search) Make exploration mode more random
2023-05-25 17:40:28 +02:00
Viktor Lofgren
6fae51a8ef
Stopgap fix for a bug in dealing with quote terms containing stop words.
2023-05-02 19:38:59 +02:00
Viktor Lofgren
a9f7b4c457
Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document.
2023-04-30 19:29:13 +02:00
Viktor Lofgren
1e3b6934bb
Reduce log noise during loading. Bad URLs don't need to be loaded, they can be grepped from the instructions.
2023-04-30 18:36:44 +02:00
Viktor
7694a15f62
Fix kale's unreasonably high weighting factor
2023-04-22 20:55:09 +02:00
Viktor
d72da01a92
Update readme.md
2023-04-22 16:05:57 +02:00
Viktor
112f43b3a1
Api service response cache ( #16 )
...
* Add response caching to the API service to help SearXNG
* Clean up the code a bit.
* Add an endpoint without a terminating slash for getLicense.
* Add tests for API service.
2023-04-22 15:42:32 +02:00
Viktor Lofgren
f12c6fd57e
Add a ranking parameter for biasing toward recent or old content.
2023-04-20 16:00:59 +02:00
Viktor
96bac70b85
Tools for merging sorted lists, and merging btrees. ( #14 )
...
* Utilities for merging BTrees of entity size 1 and 2.
* Isolate and clean up sorting algorithms.
* Functions for keeping distinct items in a LongArray
2023-04-20 15:28:09 +02:00
Viktor Lofgren
619fb8ba80
(converter) Adjust the pub-date sniffing heuristics' order. Doing HTML5 tags too early puts some sites too early. Also expanded support for JSON+LD.
2023-04-19 15:28:50 +02:00
Viktor
5a5cdaf70e
Improvements to the adjacency calculator and screenshots tool ( #13 )
...
* WIP: Improvements to website adjacencies loader tool.
* Improving screenshots capture bot.
2023-04-18 22:21:49 +02:00
Viktor Lofgren
bb587ca47f
Reformulate search-header.hdb, s/Support/Donate/ the formulation was apparently confusing some people thinking they could get support on this page.
2023-04-18 17:04:24 +02:00
Viktor Lofgren
4d298cd5fa
Improving screenshots capture bot.
2023-04-17 18:04:22 +02:00
Viktor Lofgren
fbbaf584ba
Adjustments to screenshot capture tool.
2023-04-16 08:55:57 +02:00
Viktor Lofgren
df1850bd45
Fix bug in index service where tld: and links:-queries wouldn't work.
2023-04-15 18:39:16 +02:00
Viktor Lofgren
d42ab19166
Issue 5: Fix bug where some IPv6 addresses blew up domain loading.
2023-04-15 14:11:08 +02:00
Viktor Lofgren
2ab26f37b8
Bug fix for document metadata encoding that breaks year based queries.
2023-04-14 16:56:49 +02:00
Viktor
ec7ce7b0b3
Update readme.md
2023-04-11 16:31:11 +02:00
Viktor Lofgren
3e9b37c264
Refactor website screenshot tool and website adjacencies calculator into code/tools.
2023-04-11 16:20:27 +02:00
Viktor Lofgren
502713f7a8
Reduce memory churn
2023-04-10 16:51:17 +02:00
Viktor Lofgren
e19256a6b6
Tune settings to retrieve more results.
2023-04-10 15:39:20 +02:00
Viktor Lofgren
ccc41d1717
Clean up of the index query handling related code.
2023-04-10 14:50:57 +02:00
Viktor Lofgren
e49b1dd155
Better handling of quote terms, fix bug in handling of longer queries.
...
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:20:40 +02:00
Viktor Lofgren
fe419b12b4
Better handling of quote terms, fix bug in handling of longer queries.
...
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
810515c08d
Clean up artifact extractor.
2023-04-10 13:07:54 +02:00
Viktor Lofgren
535a51a621
Repair broken year query test.
2023-04-08 12:04:09 +02:00
Viktor
a278fc6296
Increase search result relevance ( #8 )
...
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
716ab35b4e
Search ranking debuggability improvements.
2023-04-02 13:43:24 +02:00
Viktor Lofgren
3fb249758e
Adjust result ordering.
2023-04-02 12:05:22 +02:00
Viktor Lofgren
f7a6ef2179
Smarter queries, better logging.
2023-04-02 12:05:09 +02:00
Viktor Lofgren
105d93cd85
Index query builder automatically ignores redundant predicates.
2023-04-02 12:04:26 +02:00
Viktor Lofgren
1e4157017d
More helpful descriptions of index queries.
2023-04-02 12:03:58 +02:00
Viktor Lofgren
5fb75adaae
Remove antique result scoring adjustment that makes no sense anymore.
2023-04-02 11:58:04 +02:00
Viktor Lofgren
affcf8cf41
Load test tool
2023-04-02 09:43:43 +02:00
Viktor Lofgren
cc4e089a5d
Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc.
2023-03-30 15:46:15 +02:00
Viktor Lofgren
32b9c2e671
Fix SentenceExtractor jank
2023-03-30 15:45:04 +02:00
Viktor Lofgren
4d05be4095
Refactor InternalLinkGraph
2023-03-30 15:44:23 +02:00
Viktor Lofgren
137adb9c3c
Bitmask calculation improvement. Take sentence length into consideration, not all lines are equal.
2023-03-30 15:42:06 +02:00
Viktor Lofgren
16e37672fc
Bugfix crawl plan, doesn't use rewrite() everywhere
2023-03-30 15:41:07 +02:00
Viktor Lofgren
d0c72ceb7e
Improve experiment runner, convenient start script.
2023-03-30 15:40:31 +02:00
Viktor Lofgren
0fcb2b534c
Polish Names
2023-03-29 16:51:47 +02:00
Viktor Lofgren
dcf6218cdb
Fix bugs related to search result selection in the case with multiple search terms.
...
* A deduplication filter step ran too early, and removed many good results on the basis that they partially, but did not fully fit another set of search terms.
* Altered the query creation process to prefer documents where multiple terms appear in the priority index.
2023-03-29 15:18:52 +02:00
Viktor Lofgren
8f51345a1d
Add experiment runner tool and got rid of experiments module in processes.
2023-03-28 16:58:46 +02:00
Viktor Lofgren
03bd892b95
Improve document processing in conversion.
...
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor Lofgren
30584887f9
DictionaryMap changes.
...
Add new flag to change the default size to make prod index boot faster. Remove option to select OffHeapDictionaryHashMap.
2023-03-27 17:28:39 +02:00
Viktor Lofgren
17ca4f9eea
Permit search results that are all synthetic to pass relevancy check.
2023-03-27 17:27:35 +02:00
Viktor Lofgren
7fb3db3249
Fix bug where link on front page news listing wouldn't work.
...
... also changed order of date and source to make the UI more consistent.
2023-03-27 17:26:46 +02:00
Viktor Lofgren
862e925d7c
"-Dsmall-ram=TRUE" no longer does anything. Remove references to the flag, which previously reduced the memory footprint of the loader and index service.
2023-03-26 21:37:11 +02:00
Viktor Lofgren
a0027ad32b
Fix broken diagram links after doc/ restructuring.
2023-03-25 16:32:10 +01:00
Viktor Lofgren
c5f4cb34bf
Documentation for DB
2023-03-25 16:14:16 +01:00
Viktor
be3ba3ef37
Update readme.md
2023-03-25 15:27:11 +01:00
Viktor
ac1ac3ea57
Move database to a separate module
...
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor
0b505939ed
Update features-convert/readme.md
2023-03-25 12:43:58 +01:00
Viktor
d2a9e1b644
Add processes link to readme.md for code/common
2023-03-25 12:42:44 +01:00
Viktor Lofgren
3464ca514b
Fix typeahead suggestions
2023-03-25 10:20:52 +01:00
Viktor Lofgren
2f2c86a9f5
Fix bug where WmsaHome wouldn't look in /var/lib/wmsa as a fallback
2023-03-25 10:20:52 +01:00
Viktor
45dd9fea25
Update readme.md
2023-03-22 17:15:36 +01:00
Viktor
c974d72e7e
Update readme.md
2023-03-22 17:09:48 +01:00
Viktor
e3675d2fa9
Update readme.md
2023-03-22 17:02:03 +01:00
Viktor
c4a6bf7672
Update readme.md
2023-03-22 17:01:34 +01:00
Viktor
cb6865924e
Update readme.md
2023-03-22 16:59:38 +01:00
Viktor Lofgren
964014860a
Get suggestions working again
2023-03-22 15:11:22 +01:00
Viktor Lofgren
7c58ddce81
readme.md
2023-03-22 15:10:30 +01:00
Viktor Lofgren
611ba2d35a
Break apart WordPatterns class
2023-03-22 15:10:17 +01:00