Commit Graph

1025 Commits

Author SHA1 Message Date
Viktor Lofgren
619fb8ba80 (converter) Adjust the pub-date sniffing heuristics' order. Doing HTML5 tags too early puts some sites too early. Also expanded support for JSON+LD. 2023-04-19 15:28:50 +02:00
Viktor
5a5cdaf70e
Improvements to the adjacency calculator and screenshots tool (#13)
* WIP: Improvements to website adjacencies loader tool.

* Improving screenshots capture bot.
2023-04-18 22:21:49 +02:00
Viktor Lofgren
bb587ca47f Reformulate search-header.hdb, s/Support/Donate/ the formulation was apparently confusing some people thinking they could get support on this page. 2023-04-18 17:04:24 +02:00
Viktor Lofgren
4d298cd5fa Improving screenshots capture bot. 2023-04-17 18:04:22 +02:00
Viktor Lofgren
fbbaf584ba Adjustments to screenshot capture tool. 2023-04-16 08:55:57 +02:00
Viktor Lofgren
df1850bd45 Fix bug in index service where tld: and links:-queries wouldn't work. 2023-04-15 18:39:16 +02:00
Viktor Lofgren
d42ab19166 Issue 5: Fix bug where some IPv6 addresses blew up domain loading. 2023-04-15 14:11:08 +02:00
Viktor Lofgren
2ab26f37b8 Bug fix for document metadata encoding that breaks year based queries. 2023-04-14 16:56:49 +02:00
Viktor
ec7ce7b0b3
Update readme.md 2023-04-11 16:31:11 +02:00
Viktor Lofgren
3e9b37c264 Refactor website screenshot tool and website adjacencies calculator into code/tools. 2023-04-11 16:20:27 +02:00
Viktor Lofgren
502713f7a8 Reduce memory churn 2023-04-10 16:51:17 +02:00
Viktor Lofgren
e19256a6b6 Tune settings to retrieve more results. 2023-04-10 15:39:20 +02:00
Viktor Lofgren
ccc41d1717 Clean up of the index query handling related code. 2023-04-10 14:50:57 +02:00
Viktor Lofgren
e49b1dd155 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:20:40 +02:00
Viktor Lofgren
fe419b12b4 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
810515c08d Clean up artifact extractor. 2023-04-10 13:07:54 +02:00
Viktor Lofgren
535a51a621 Repair broken year query test. 2023-04-08 12:04:09 +02:00
Viktor
a278fc6296
Increase search result relevance (#8)
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor
f1c6525a50
Update setup.sh 2023-04-02 14:44:43 +02:00
Viktor
ace0d19973
Update README.md 2023-04-02 14:43:14 +02:00
Viktor
40b8c8c128
Update README.md 2023-04-02 14:08:17 +02:00
Viktor Lofgren
716ab35b4e Search ranking debuggability improvements. 2023-04-02 13:43:24 +02:00
Viktor Lofgren
3fb249758e Adjust result ordering. 2023-04-02 12:05:22 +02:00
Viktor Lofgren
f7a6ef2179 Smarter queries, better logging. 2023-04-02 12:05:09 +02:00
Viktor Lofgren
105d93cd85 Index query builder automatically ignores redundant predicates. 2023-04-02 12:04:26 +02:00
Viktor Lofgren
1e4157017d More helpful descriptions of index queries. 2023-04-02 12:03:58 +02:00
Viktor Lofgren
5fb75adaae Remove antique result scoring adjustment that makes no sense anymore. 2023-04-02 11:58:04 +02:00
Viktor Lofgren
affcf8cf41 Load test tool 2023-04-02 09:43:43 +02:00
Viktor Lofgren
cc4e089a5d Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc. 2023-03-30 15:46:15 +02:00
Viktor Lofgren
32b9c2e671 Fix SentenceExtractor jank 2023-03-30 15:45:04 +02:00
Viktor Lofgren
4d05be4095 Refactor InternalLinkGraph 2023-03-30 15:44:23 +02:00
Viktor Lofgren
137adb9c3c Bitmask calculation improvement. Take sentence length into consideration, not all lines are equal. 2023-03-30 15:42:06 +02:00
Viktor Lofgren
16e37672fc Bugfix crawl plan, doesn't use rewrite() everywhere 2023-03-30 15:41:07 +02:00
Viktor Lofgren
d0c72ceb7e Improve experiment runner, convenient start script. 2023-03-30 15:40:31 +02:00
Viktor Lofgren
0fcb2b534c Polish Names 2023-03-29 16:51:47 +02:00
Viktor Lofgren
dcf6218cdb Fix bugs related to search result selection in the case with multiple search terms.
* A deduplication filter step ran too early, and removed many good results on the basis that they partially, but did not fully fit another set of search terms.

* Altered the query creation process to prefer documents where multiple terms appear in the priority index.
2023-03-29 15:18:52 +02:00
Viktor Lofgren
8f51345a1d Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
Viktor Lofgren
03bd892b95 Improve document processing in conversion.
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor Lofgren
1e65ac3940 Improve useful-resources.md 2023-03-28 16:35:58 +02:00
Viktor
e622437560
Create FUNDING.yml 2023-03-28 13:13:49 +02:00
Viktor Lofgren
30584887f9 DictionaryMap changes.
Add new flag to change the default size to make prod index boot faster. Remove option to select OffHeapDictionaryHashMap.
2023-03-27 17:28:39 +02:00
Viktor Lofgren
17ca4f9eea Permit search results that are all synthetic to pass relevancy check. 2023-03-27 17:27:35 +02:00
Viktor Lofgren
7fb3db3249 Fix bug where link on front page news listing wouldn't work.
... also changed order of date and source to make the UI more consistent.
2023-03-27 17:26:46 +02:00
Viktor Lofgren
b60fcd0918 Documentation improvements 2023-03-27 17:25:27 +02:00
Viktor Lofgren
862e925d7c "-Dsmall-ram=TRUE" no longer does anything. Remove references to the flag, which previously reduced the memory footprint of the loader and index service. 2023-03-26 21:37:11 +02:00
Viktor Lofgren
a0027ad32b Fix broken diagram links after doc/ restructuring. 2023-03-25 16:32:10 +01:00
Viktor Lofgren
c5f4cb34bf Documentation for DB 2023-03-25 16:14:16 +01:00
Viktor
2e69179f12
Update readme.md 2023-03-25 15:47:45 +01:00
Viktor
19000ab339
Create readme.md 2023-03-25 15:46:19 +01:00
Viktor
be3ba3ef37
Update readme.md 2023-03-25 15:27:11 +01:00