Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
a5d980ee56
(converter) Hook crawl job extractor and adjacencies calculator into control service.
2023-07-26 15:46:22 +02:00
Viktor Lofgren
a56953c798
(converter, WIP) Refactor converter to not have to load everything into RAM.
2023-07-24 15:25:09 +02:00
Viktor Lofgren
c069c8c182
(crawler) Clean up crawl data reference and recrawl logic
2023-07-22 18:42:21 +02:00
Viktor Lofgren
f91d92cccb
(crawler) WIP
2023-07-20 21:05:16 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor Lofgren
480abfe966
(minor) Add limit to pol count in MqPersistence, fix test
2023-07-12 18:16:23 +02:00
Viktor Lofgren
74caf9e38a
(processes) Remove forEach-constructs in favor of iterators.
2023-07-12 17:47:36 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
dbb758d1a8
Minor: Better error handling in crawled domain reader
2023-07-10 18:58:43 +02:00
Viktor Lofgren
e7af77e151
Tests for crawler specialization + testdata
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61
Refactor crawler and add special logic for some platforms
...
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
a9fabba407
Tell experiment runner to only process some domains.
...
Updated the experiment runner, as well as the script.
2023-06-20 14:14:01 +02:00
Viktor Lofgren
4fc0ddbc45
Improved crawl-job-extractor.
...
Let crawl-job-extractor run offline and allow it to read domains from file.
Improved docs.
2023-06-20 11:37:52 +02:00
Viktor Lofgren
7ed3306be3
Make the adjacency calculator behave like it used to in the past, when it gave better results.
2023-06-07 22:03:06 +02:00
Viktor Lofgren
2afbdc2269
Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously.
2023-06-07 22:01:35 +02:00
Viktor
5a5cdaf70e
Improvements to the adjacency calculator and screenshots tool ( #13 )
...
* WIP: Improvements to website adjacencies loader tool.
* Improving screenshots capture bot.
2023-04-18 22:21:49 +02:00
Viktor Lofgren
4d298cd5fa
Improving screenshots capture bot.
2023-04-17 18:04:22 +02:00
Viktor Lofgren
fbbaf584ba
Adjustments to screenshot capture tool.
2023-04-16 08:55:57 +02:00
Viktor Lofgren
3e9b37c264
Refactor website screenshot tool and website adjacencies calculator into code/tools.
2023-04-11 16:20:27 +02:00
Viktor Lofgren
fe419b12b4
Better handling of quote terms, fix bug in handling of longer queries.
...
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
716ab35b4e
Search ranking debuggability improvements.
2023-04-02 13:43:24 +02:00
Viktor Lofgren
affcf8cf41
Load test tool
2023-04-02 09:43:43 +02:00
Viktor Lofgren
d0c72ceb7e
Improve experiment runner, convenient start script.
2023-03-30 15:40:31 +02:00
Viktor Lofgren
8f51345a1d
Add experiment runner tool and got rid of experiments module in processes.
2023-03-28 16:58:46 +02:00
Viktor
ac1ac3ea57
Move database to a separate module
...
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor Lofgren
2eb972dea1
Remove unrelated code, break tools into their own directory.
2023-03-17 16:03:11 +01:00