Commit Graph

29 Commits

Author SHA1 Message Date
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
a5d980ee56 (converter) Hook crawl job extractor and adjacencies calculator into control service. 2023-07-26 15:46:22 +02:00
Viktor Lofgren
a56953c798 (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
Viktor Lofgren
c069c8c182 (crawler) Clean up crawl data reference and recrawl logic 2023-07-22 18:42:21 +02:00
Viktor Lofgren
f91d92cccb (crawler) WIP 2023-07-20 21:05:16 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor Lofgren
480abfe966 (minor) Add limit to pol count in MqPersistence, fix test 2023-07-12 18:16:23 +02:00
Viktor Lofgren
74caf9e38a (processes) Remove forEach-constructs in favor of iterators. 2023-07-12 17:47:36 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
dbb758d1a8 Minor: Better error handling in crawled domain reader 2023-07-10 18:58:43 +02:00
Viktor Lofgren
e7af77e151 Tests for crawler specialization + testdata 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61 Refactor crawler and add special logic for some platforms
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
a9fabba407 Tell experiment runner to only process some domains.
Updated the experiment runner, as well as the script.
2023-06-20 14:14:01 +02:00
Viktor Lofgren
4fc0ddbc45 Improved crawl-job-extractor.
Let crawl-job-extractor run offline and allow it to read domains from file.
Improved docs.
2023-06-20 11:37:52 +02:00
Viktor Lofgren
7ed3306be3 Make the adjacency calculator behave like it used to in the past, when it gave better results. 2023-06-07 22:03:06 +02:00
Viktor Lofgren
2afbdc2269 Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously. 2023-06-07 22:01:35 +02:00
Viktor
5a5cdaf70e
Improvements to the adjacency calculator and screenshots tool (#13)
* WIP: Improvements to website adjacencies loader tool.

* Improving screenshots capture bot.
2023-04-18 22:21:49 +02:00
Viktor Lofgren
4d298cd5fa Improving screenshots capture bot. 2023-04-17 18:04:22 +02:00
Viktor Lofgren
fbbaf584ba Adjustments to screenshot capture tool. 2023-04-16 08:55:57 +02:00
Viktor Lofgren
3e9b37c264 Refactor website screenshot tool and website adjacencies calculator into code/tools. 2023-04-11 16:20:27 +02:00
Viktor Lofgren
fe419b12b4 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
716ab35b4e Search ranking debuggability improvements. 2023-04-02 13:43:24 +02:00
Viktor Lofgren
affcf8cf41 Load test tool 2023-04-02 09:43:43 +02:00
Viktor Lofgren
d0c72ceb7e Improve experiment runner, convenient start script. 2023-03-30 15:40:31 +02:00
Viktor Lofgren
8f51345a1d Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
Viktor
ac1ac3ea57
Move database to a separate module
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor Lofgren
2eb972dea1 Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00