Commit Graph

1110 Commits

Author SHA1 Message Date
Viktor Lofgren
1d0cea1d55 (converter) GUI for dealing with user complaints 2023-08-03 17:59:57 +02:00
Viktor Lofgren
f01f608474 (blacklist) Support blacklists with subdomain 2023-08-03 17:58:52 +02:00
Viktor Lofgren
c22feaf42e (crawl) Make crawler limiter request a GC when throttling 2023-08-03 17:58:18 +02:00
Viktor Lofgren
63e857f7cd (control) Add basic api key management 2023-08-02 20:14:03 +02:00
Viktor Lofgren
9979c9defe (search/index) Add blogosphere filter 2023-08-02 20:13:30 +02:00
Viktor Lofgren
7763df0715 (docs) Add control-service to the main readme.md 2023-08-01 22:52:41 +02:00
Viktor Lofgren
e088eb9ec8 (scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:50:33 +02:00
Viktor Lofgren
19402772fc (scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:50:05 +02:00
Viktor Lofgren
ba724bc1b2 (scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:47:37 +02:00
Viktor Lofgren
8de3e6ab80 (control) Fix bug where CrawlActor and RecrawlActor would steal each others' mail 2023-08-01 22:33:30 +02:00
Viktor Lofgren
659d2134ba (file-storage) Deprecate mustClean flag 2023-08-01 22:32:30 +02:00
Viktor Lofgren
867410c66b (file-storage) Automatic file storage discovery via manifest file 2023-08-01 18:05:43 +02:00
Viktor Lofgren
483c2dbb44 (conf) Change default user-agent to not associate it with the project; remove unused disks.properties file. 2023-08-01 17:34:25 +02:00
Viktor Lofgren
e5c9791b14 (crawler) Fix rare ConcurrentModificationError due to HashSet 2023-08-01 17:28:29 +02:00
Viktor Lofgren
58556af6c7 (db) Use flwyay for database migrations. 2023-08-01 17:08:42 +02:00
Viktor Lofgren
2e29038ecd (db) Fix broken insert statement, move file storage defaults to a separate file. 2023-08-01 15:50:08 +02:00
Viktor Lofgren
36a23707c1 (control) Control service should be a core service. 2023-08-01 15:49:50 +02:00
Viktor Lofgren
c1ea60b399 (db) Default values for storage base 2023-08-01 15:05:04 +02:00
Viktor Lofgren
b08e302dd5 (lexicon) Optimize lexicon by using Murmur3_128's hash function 2023-08-01 15:02:13 +02:00
Viktor Lofgren
ea66195b97 (loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash 2023-08-01 15:02:13 +02:00
Viktor Lofgren
86a5cc5c5f (hash) Modified version of common codec's Murmur3 hash 2023-08-01 14:57:40 +02:00
Viktor Lofgren
8f0cbf267b (loader) Perform instruction reads in a separate thread for extra vroom vroom 2023-07-31 14:24:08 +02:00
Viktor Lofgren
2f8488610a (loader) Fix bug where trailing deferred domain meta inserts weren't executed 2023-07-31 14:23:23 +02:00
Viktor Lofgren
d95f01b701 (control) Reduce log spam in control svc 2023-07-31 14:21:06 +02:00
Viktor Lofgren
c9d7635370 (control) Aborting an actor that waits on a process request terminates the running job.
(control) Aborting an actor that waits on a process request terminates the running job.
2023-07-31 14:21:06 +02:00
Viktor Lofgren
6b5fb0f841 (control) Disable the start button for actors that aren't directly initializable.
(control) Disable the start button for actors that aren't directly initializable.
2023-07-31 14:21:00 +02:00
Viktor Lofgren
12bd74d4f3 Clean up ProcessService 2023-07-31 10:56:16 +02:00
Viktor Lofgren
37c4cc68ed TODO 2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8 (minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers. 2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f YAGNI filter over ConverterDomainTypes 2023-07-31 10:32:47 +02:00
Viktor Lofgren
9786f82220 Fix environment variables to processes so jmc works 2023-07-31 10:32:23 +02:00
Viktor Lofgren
6f4e767a04 (minor) Re-enable monkey-patch-json for converter 2023-07-31 10:31:46 +02:00
Viktor Lofgren
5411950b87 (minor) Tidy up EdgeDomain class a bit, no functional difference 2023-07-31 10:31:29 +02:00
Viktor Lofgren
6ff7e9648f (crawler) Use and pass the proper environment variables to the processes. 2023-07-30 16:54:02 +02:00
Viktor Lofgren
5c071ce4d3 (crawler) Clean up the code and remove unnecessary logging 2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8 (crawler) Fix rare issue with NPEs if the crawl queue is empty 2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4 (crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f (crawler) Reduce log spam 2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0 (crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size. 2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48 (crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended. 2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171 (crawler, converter) Remove monkey patched gson from dependencies 2023-07-29 19:18:12 +02:00
Viktor Lofgren
05ba3bab96 (crawler) Make SitemapRetriever abort on too large sitemaps. 2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c (crawler) Reduce log spam in HttpFetcherImpl 2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d (crawler) Reduce long term memory allocation in DomainCrawlFrontier
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
9ad32ee9c7 (control) Be more clear about when a process exits and why. 2023-07-29 19:16:00 +02:00
Viktor Lofgren
866db6c63f (control) Dialog for updating message state; clean up file view. 2023-07-28 22:02:05 +02:00
Viktor Lofgren
01476577b8 (loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10 (converter) Use a dumb thread pool instead of Java's executor service. 2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d (WIP) Make it possible to sideload encyclopedia data.
This is mostly a pilot track for sideloading other large websites.

Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
9288d311d4 Add buffering to index journal writer 2023-07-28 18:11:19 +02:00