Commit Graph

29 Commits

Author SHA1 Message Date
Viktor Lofgren
45987a1d98 Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
Viktor Lofgren
f655ec5a5c (*) Refactor GeoIP-related code
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.

The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.

The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.

The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
2023-12-10 17:30:43 +01:00
Viktor Lofgren
fabffa80f0 (warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader 2023-12-07 15:26:01 +01:00
Viktor Lofgren
cc813a5624 (convert) Add basic support for Warc file sideloading
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
2023-12-06 18:43:55 +01:00
Viktor Lofgren
0152004c42 Initial Commit Anchor Tags
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
2023-11-04 14:24:17 +01:00
Viktor Lofgren
c51159672e (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
70aa04c047 (converter, stackexchange-xml) Add the ability to sideload stackexchange data 2023-09-21 12:48:33 +02:00
Viktor Lofgren
d895f83520 (blocking-thread-pool) Move DumbThreadPool to its own micro-library
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
9b385ec7cc (converter) Make it possible to sideload documents from a directory tree 2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46 (crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
eaeb23d41e (refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00
Viktor Lofgren
24b4606f96 (converter,loader) Converter outputs parquet files instead of compressed json. 2023-09-13 16:13:41 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
e7192a9cad (mq) Refactor mq and actor library and move it to libraries out of common 2023-08-15 10:53:23 +02:00
Viktor Lofgren
4ab1cd9502 (*) last touches 2023-08-07 12:57:44 +02:00
Viktor Lofgren
6f4e767a04 (minor) Re-enable monkey-patch-json for converter 2023-07-31 10:31:46 +02:00
Viktor Lofgren
d3f01bd171 (crawler, converter) Remove monkey patched gson from dependencies 2023-07-29 19:18:12 +02:00
Viktor Lofgren
f11103d31d (WIP) Make it possible to sideload encyclopedia data.
This is mostly a pilot track for sideloading other large websites.

Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
bca4bbb6c8 (*) Refactor MQ and MQSM 2023-07-17 13:57:32 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
d71124961e Better tests for crawling and processing. 2023-06-27 16:11:27 +02:00
Viktor Lofgren
f8f9f04158 Specialized logic for processing Lemmy-based websites. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
266ad2e4de Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.

fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
449471a076 Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
Viktor Lofgren
d82532b7f1 More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00