Viktor Lofgren
81ae501e73
(converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well
2023-09-25 18:28:34 +02:00
Viktor Lofgren
f797a92f87
(converter, minor) Use domain name in task heartbeat progress
2023-09-25 18:27:04 +02:00
Viktor Lofgren
a433bbbe45
(converter) Fix rare sentence extractor bug
...
It was caused by non-thread safe concurrent memory access in SentenceExtractor.
2023-09-24 19:39:48 +02:00
Viktor Lofgren
8ca20f184d
(keyword-extraction) Chasing my tail looking for a bug
2023-09-24 19:39:48 +02:00
Viktor Lofgren
dbe9235f3a
(*) Upgrade to JDK21 with preview enabled.
...
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
ad660cf420
(converter) Bugfix: Don't try to Path.of() on optional field
2023-09-21 13:27:09 +02:00
Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb
(converter) JavadocSpecialization should truncate its summary if it gets too long
2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028
(converter) DirtreeSideloader now trims /index.html from the URL if present
...
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
c67d95c00f
(converter) Write dummy processor log when sideloading
2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e
(converter, control) Re-enable sideloading encyclopedia data
2023-09-14 12:12:07 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417
(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup
2023-09-14 10:11:57 +02:00
Viktor Lofgren
4799dd769e
(converting) WIP begin to remove converting-model and the old InstructionsCompiler
2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96
(converter,loader) Converter outputs parquet files instead of compressed json.
2023-09-13 16:13:41 +02:00
Viktor Lofgren
39c1857c61
(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.
2023-08-29 13:07:55 +02:00
Viktor Lofgren
3101b74580
(index) Move to a lexicon-free index design
...
This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.
The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.
It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd
(index,control) Recoverable index backups
2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2
(db) Remove EC_URL and EC_PAGE_DATA from mariadb database
2023-08-25 13:45:03 +02:00
Viktor Lofgren
6a04cdfddf
(loader) Implement new linkdb in loader
...
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.
For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.
2023-08-24 13:07:54 +02:00
Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
bf92c270dc
(language) Rollback language filter change a bit.
...
It appears to lead to too much junk in the lexicon.
2023-08-23 10:16:57 +02:00
Viktor Lofgren
e507844616
(language) Rollback language filter change a bit.
...
It appears to lead to too much junk in the lexicon.
2023-08-23 10:03:25 +02:00
Viktor Lofgren
704de50a9b
(forward-index, valuator) HTML features in valuator
...
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
fcfe07fb7d
(valuator) Clean up code
2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add
(minor) Clean up code
2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845
(feature-extractor) More adtech nonsense
2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae
(minor) Improve comment
2023-08-18 11:26:05 +02:00
Viktor Lofgren
bee815b1c4
(converter) Add monsterinsights as an adtech tracker
2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649
(converter) Optimize LSH based within-domain deduplication
2023-08-17 17:43:46 +02:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f
(valuation) Penalize wordpress style kebab case urls
2023-08-16 13:11:24 +02:00
Viktor Lofgren
d8073f0dde
(feature-extractor) Add mail.ru counter to non-adtech trackers
2023-08-15 19:10:43 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
ce293029c7
(converter) Treat adtech tracking as advertisement.
2023-08-09 14:23:53 +02:00
Viktor Lofgren
4ab1cd9502
(*) last touches
2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program
2023-08-07 12:53:43 +02:00
Viktor Lofgren
2f8488610a
(loader) Fix bug where trailing deferred domain meta inserts weren't executed
2023-07-31 14:23:23 +02:00
Viktor Lofgren
37c4cc68ed
TODO
2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8
(minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers.
2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f
YAGNI filter over ConverterDomainTypes
2023-07-31 10:32:47 +02:00
Viktor Lofgren
6f4e767a04
(minor) Re-enable monkey-patch-json for converter
2023-07-31 10:31:46 +02:00
Viktor Lofgren
730e8f74e4
(crawler) Even more memory optimizations.
...
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
ee143bbc48
(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.
2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171
(crawler, converter) Remove monkey patched gson from dependencies
2023-07-29 19:18:12 +02:00
Viktor Lofgren
f11103d31d
(WIP) Make it possible to sideload encyclopedia data.
...
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00