Viktor Lofgren
dbe9235f3a
(*) Upgrade to JDK21 with preview enabled.
...
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
24b4606f96
(converter,loader) Converter outputs parquet files instead of compressed json.
2023-09-13 16:13:41 +02:00
Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
4ab1cd9502
(*) last touches
2023-08-07 12:57:44 +02:00
Viktor Lofgren
6f4e767a04
(minor) Re-enable monkey-patch-json for converter
2023-07-31 10:31:46 +02:00
Viktor Lofgren
d3f01bd171
(crawler, converter) Remove monkey patched gson from dependencies
2023-07-29 19:18:12 +02:00
Viktor Lofgren
f11103d31d
(WIP) Make it possible to sideload encyclopedia data.
...
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
bca4bbb6c8
(*) Refactor MQ and MQSM
2023-07-17 13:57:32 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
d71124961e
Better tests for crawling and processing.
2023-06-27 16:11:27 +02:00
Viktor Lofgren
f8f9f04158
Specialized logic for processing Lemmy-based websites.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
266ad2e4de
Re-introduce monkey patched GSON to make converter run better.
...
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
449471a076
Yet more restructuring. Improved search result ranking.
2023-03-16 21:35:54 +01:00
Viktor Lofgren
d82532b7f1
More restructuring, big bug fixes in keyword extraction.
2023-03-13 17:39:53 +01:00