Commit Graph

933 Commits

Author SHA1 Message Date
Viktor Lofgren
483c2dbb44 (conf) Change default user-agent to not associate it with the project; remove unused disks.properties file. 2023-08-01 17:34:25 +02:00
Viktor Lofgren
e5c9791b14 (crawler) Fix rare ConcurrentModificationError due to HashSet 2023-08-01 17:28:29 +02:00
Viktor Lofgren
58556af6c7 (db) Use flwyay for database migrations. 2023-08-01 17:08:42 +02:00
Viktor Lofgren
2e29038ecd (db) Fix broken insert statement, move file storage defaults to a separate file. 2023-08-01 15:50:08 +02:00
Viktor Lofgren
36a23707c1 (control) Control service should be a core service. 2023-08-01 15:49:50 +02:00
Viktor Lofgren
c1ea60b399 (db) Default values for storage base 2023-08-01 15:05:04 +02:00
Viktor Lofgren
b08e302dd5 (lexicon) Optimize lexicon by using Murmur3_128's hash function 2023-08-01 15:02:13 +02:00
Viktor Lofgren
ea66195b97 (loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash 2023-08-01 15:02:13 +02:00
Viktor Lofgren
86a5cc5c5f (hash) Modified version of common codec's Murmur3 hash 2023-08-01 14:57:40 +02:00
Viktor Lofgren
8f0cbf267b (loader) Perform instruction reads in a separate thread for extra vroom vroom 2023-07-31 14:24:08 +02:00
Viktor Lofgren
2f8488610a (loader) Fix bug where trailing deferred domain meta inserts weren't executed 2023-07-31 14:23:23 +02:00
Viktor Lofgren
d95f01b701 (control) Reduce log spam in control svc 2023-07-31 14:21:06 +02:00
Viktor Lofgren
c9d7635370 (control) Aborting an actor that waits on a process request terminates the running job.
(control) Aborting an actor that waits on a process request terminates the running job.
2023-07-31 14:21:06 +02:00
Viktor Lofgren
6b5fb0f841 (control) Disable the start button for actors that aren't directly initializable.
(control) Disable the start button for actors that aren't directly initializable.
2023-07-31 14:21:00 +02:00
Viktor Lofgren
12bd74d4f3 Clean up ProcessService 2023-07-31 10:56:16 +02:00
Viktor Lofgren
37c4cc68ed TODO 2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8 (minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers. 2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f YAGNI filter over ConverterDomainTypes 2023-07-31 10:32:47 +02:00
Viktor Lofgren
9786f82220 Fix environment variables to processes so jmc works 2023-07-31 10:32:23 +02:00
Viktor Lofgren
6f4e767a04 (minor) Re-enable monkey-patch-json for converter 2023-07-31 10:31:46 +02:00
Viktor Lofgren
5411950b87 (minor) Tidy up EdgeDomain class a bit, no functional difference 2023-07-31 10:31:29 +02:00
Viktor Lofgren
6ff7e9648f (crawler) Use and pass the proper environment variables to the processes. 2023-07-30 16:54:02 +02:00
Viktor Lofgren
5c071ce4d3 (crawler) Clean up the code and remove unnecessary logging 2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8 (crawler) Fix rare issue with NPEs if the crawl queue is empty 2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4 (crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f (crawler) Reduce log spam 2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0 (crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size. 2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48 (crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended. 2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171 (crawler, converter) Remove monkey patched gson from dependencies 2023-07-29 19:18:12 +02:00
Viktor Lofgren
05ba3bab96 (crawler) Make SitemapRetriever abort on too large sitemaps. 2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c (crawler) Reduce log spam in HttpFetcherImpl 2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d (crawler) Reduce long term memory allocation in DomainCrawlFrontier
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
9ad32ee9c7 (control) Be more clear about when a process exits and why. 2023-07-29 19:16:00 +02:00
Viktor Lofgren
866db6c63f (control) Dialog for updating message state; clean up file view. 2023-07-28 22:02:05 +02:00
Viktor Lofgren
01476577b8 (loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10 (converter) Use a dumb thread pool instead of Java's executor service. 2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d (WIP) Make it possible to sideload encyclopedia data.
This is mostly a pilot track for sideloading other large websites.

Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
9288d311d4 Add buffering to index journal writer 2023-07-28 18:11:19 +02:00
Viktor Lofgren
77d5e39fe0 Make processed data Serializable 2023-07-28 18:11:19 +02:00
Viktor Lofgren
27e781761d (mq single shot inbox) Flag messages as OK if there is no recipient 2023-07-28 12:04:23 +02:00
Viktor Lofgren
92cac52813 (mq) Add indexes to MESSAGE_QUEUE 2023-07-28 12:03:51 +02:00
Viktor Lofgren
66bb12e55a (converter) File listing and download for file storage 2023-07-26 21:59:35 +02:00
Viktor Lofgren
a5d980ee56 (converter) Hook crawl job extractor and adjacencies calculator into control service. 2023-07-26 15:46:22 +02:00
Viktor Lofgren
19c2ceec9b (converter) Use Marginalia Yellow for control service 2023-07-26 11:50:23 +02:00
Viktor Lofgren
507f26ad47 (converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.

(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd (loader) Don't delete the entire link database when the loader runs 2023-07-24 18:37:35 +02:00
Viktor Lofgren
09fd0a1d0e (converter) Automatically clean stale file storage records if they disappear on disk 2023-07-24 17:04:42 +02:00
Viktor Lofgren
667b0ca0b0 (converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798 (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
Viktor Lofgren
7470c170b1 (minor) EdgeUrl.parse() should deal with null 2023-07-24 15:06:57 +02:00