Commit Graph

122 Commits

Author SHA1 Message Date
Viktor Lofgren
bf92c270dc (language) Rollback language filter change a bit.
It appears to lead to too much junk in the lexicon.
2023-08-23 10:16:57 +02:00
Viktor Lofgren
e507844616 (language) Rollback language filter change a bit.
It appears to lead to too much junk in the lexicon.
2023-08-23 10:03:25 +02:00
Viktor Lofgren
ca12dd59f7 (loader) Fix Cleaner resource leak
Apparently Cleaners have an associated native thread, so the way to use them is to have a single static cleaner.
2023-08-22 18:05:00 +02:00
Viktor Lofgren
46409c4c2d (loader) Use the correct interface for InstructionCounter 2023-08-22 11:11:36 +02:00
Viktor Lofgren
704de50a9b (forward-index, valuator) HTML features in valuator
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
fcfe07fb7d (valuator) Clean up code 2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add (minor) Clean up code 2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845 (feature-extractor) More adtech nonsense 2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae (minor) Improve comment 2023-08-18 11:26:05 +02:00
Viktor Lofgren
bee815b1c4 (converter) Add monsterinsights as an adtech tracker 2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649 (converter) Optimize LSH based within-domain deduplication 2023-08-17 17:43:46 +02:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f (valuation) Penalize wordpress style kebab case urls 2023-08-16 13:11:24 +02:00
Viktor Lofgren
1d486bddee (crawler) Reduce log spam 2023-08-16 11:12:09 +02:00
Viktor Lofgren
d8073f0dde (feature-extractor) Add mail.ru counter to non-adtech trackers 2023-08-15 19:10:43 +02:00
Viktor Lofgren
e7192a9cad (mq) Refactor mq and actor library and move it to libraries out of common 2023-08-15 10:53:23 +02:00
Viktor Lofgren
ce293029c7 (converter) Treat adtech tracking as advertisement. 2023-08-09 14:23:53 +02:00
Viktor Lofgren
251fc63b42 (*) Fix merge gore 2023-08-09 13:33:28 +02:00
Viktor Lofgren
4ab1cd9502 (*) last touches 2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program 2023-08-07 12:53:43 +02:00
Viktor Lofgren
c22feaf42e (crawl) Make crawler limiter request a GC when throttling 2023-08-03 17:58:18 +02:00
Viktor Lofgren
e5c9791b14 (crawler) Fix rare ConcurrentModificationError due to HashSet 2023-08-01 17:28:29 +02:00
Viktor Lofgren
58556af6c7 (db) Use flwyay for database migrations. 2023-08-01 17:08:42 +02:00
Viktor Lofgren
ea66195b97 (loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash 2023-08-01 15:02:13 +02:00
Viktor Lofgren
8f0cbf267b (loader) Perform instruction reads in a separate thread for extra vroom vroom 2023-07-31 14:24:08 +02:00
Viktor Lofgren
2f8488610a (loader) Fix bug where trailing deferred domain meta inserts weren't executed 2023-07-31 14:23:23 +02:00
Viktor Lofgren
37c4cc68ed TODO 2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8 (minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers. 2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f YAGNI filter over ConverterDomainTypes 2023-07-31 10:32:47 +02:00
Viktor Lofgren
6f4e767a04 (minor) Re-enable monkey-patch-json for converter 2023-07-31 10:31:46 +02:00
Viktor Lofgren
5c071ce4d3 (crawler) Clean up the code and remove unnecessary logging 2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8 (crawler) Fix rare issue with NPEs if the crawl queue is empty 2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4 (crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f (crawler) Reduce log spam 2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0 (crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size. 2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48 (crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended. 2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171 (crawler, converter) Remove monkey patched gson from dependencies 2023-07-29 19:18:12 +02:00
Viktor Lofgren
05ba3bab96 (crawler) Make SitemapRetriever abort on too large sitemaps. 2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c (crawler) Reduce log spam in HttpFetcherImpl 2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d (crawler) Reduce long term memory allocation in DomainCrawlFrontier
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
01476577b8 (loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10 (converter) Use a dumb thread pool instead of Java's executor service. 2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d (WIP) Make it possible to sideload encyclopedia data.
This is mostly a pilot track for sideloading other large websites.

Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
507f26ad47 (converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.

(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd (loader) Don't delete the entire link database when the loader runs 2023-07-24 18:37:35 +02:00
Viktor Lofgren
667b0ca0b0 (converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798 (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
Viktor Lofgren
35b29e4f9e (crawler) Clean up and refactor the code a bit 2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf (crawler) Clean up and refactor the code a bit 2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182 (crawler) Clean up crawl data reference and recrawl logic 2023-07-22 18:42:21 +02:00