Viktor Lofgren
1d0cea1d55
(converter) GUI for dealing with user complaints
2023-08-03 17:59:57 +02:00
Viktor Lofgren
f01f608474
(blacklist) Support blacklists with subdomain
2023-08-03 17:58:52 +02:00
Viktor Lofgren
c22feaf42e
(crawl) Make crawler limiter request a GC when throttling
2023-08-03 17:58:18 +02:00
Viktor Lofgren
63e857f7cd
(control) Add basic api key management
2023-08-02 20:14:03 +02:00
Viktor Lofgren
9979c9defe
(search/index) Add blogosphere filter
2023-08-02 20:13:30 +02:00
Viktor Lofgren
7763df0715
(docs) Add control-service to the main readme.md
2023-08-01 22:52:41 +02:00
Viktor Lofgren
e088eb9ec8
(scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows.
2023-08-01 22:50:33 +02:00
Viktor Lofgren
19402772fc
(scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows.
2023-08-01 22:50:05 +02:00
Viktor Lofgren
ba724bc1b2
(scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows.
2023-08-01 22:47:37 +02:00
Viktor Lofgren
8de3e6ab80
(control) Fix bug where CrawlActor and RecrawlActor would steal each others' mail
2023-08-01 22:33:30 +02:00
Viktor Lofgren
659d2134ba
(file-storage) Deprecate mustClean flag
2023-08-01 22:32:30 +02:00
Viktor Lofgren
867410c66b
(file-storage) Automatic file storage discovery via manifest file
2023-08-01 18:05:43 +02:00
Viktor Lofgren
483c2dbb44
(conf) Change default user-agent to not associate it with the project; remove unused disks.properties file.
2023-08-01 17:34:25 +02:00
Viktor Lofgren
e5c9791b14
(crawler) Fix rare ConcurrentModificationError due to HashSet
2023-08-01 17:28:29 +02:00
Viktor Lofgren
58556af6c7
(db) Use flwyay for database migrations.
2023-08-01 17:08:42 +02:00
Viktor Lofgren
2e29038ecd
(db) Fix broken insert statement, move file storage defaults to a separate file.
2023-08-01 15:50:08 +02:00
Viktor Lofgren
36a23707c1
(control) Control service should be a core service.
2023-08-01 15:49:50 +02:00
Viktor Lofgren
c1ea60b399
(db) Default values for storage base
2023-08-01 15:05:04 +02:00
Viktor Lofgren
b08e302dd5
(lexicon) Optimize lexicon by using Murmur3_128's hash function
2023-08-01 15:02:13 +02:00
Viktor Lofgren
ea66195b97
(loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash
2023-08-01 15:02:13 +02:00
Viktor Lofgren
86a5cc5c5f
(hash) Modified version of common codec's Murmur3 hash
2023-08-01 14:57:40 +02:00
Viktor Lofgren
8f0cbf267b
(loader) Perform instruction reads in a separate thread for extra vroom vroom
2023-07-31 14:24:08 +02:00
Viktor Lofgren
2f8488610a
(loader) Fix bug where trailing deferred domain meta inserts weren't executed
2023-07-31 14:23:23 +02:00
Viktor Lofgren
d95f01b701
(control) Reduce log spam in control svc
2023-07-31 14:21:06 +02:00
Viktor Lofgren
c9d7635370
(control) Aborting an actor that waits on a process request terminates the running job.
...
(control) Aborting an actor that waits on a process request terminates the running job.
2023-07-31 14:21:06 +02:00
Viktor Lofgren
6b5fb0f841
(control) Disable the start button for actors that aren't directly initializable.
...
(control) Disable the start button for actors that aren't directly initializable.
2023-07-31 14:21:00 +02:00
Viktor Lofgren
12bd74d4f3
Clean up ProcessService
2023-07-31 10:56:16 +02:00
Viktor Lofgren
37c4cc68ed
TODO
2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8
(minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers.
2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f
YAGNI filter over ConverterDomainTypes
2023-07-31 10:32:47 +02:00
Viktor Lofgren
9786f82220
Fix environment variables to processes so jmc works
2023-07-31 10:32:23 +02:00
Viktor Lofgren
6f4e767a04
(minor) Re-enable monkey-patch-json for converter
2023-07-31 10:31:46 +02:00
Viktor Lofgren
5411950b87
(minor) Tidy up EdgeDomain class a bit, no functional difference
2023-07-31 10:31:29 +02:00
Viktor Lofgren
6ff7e9648f
(crawler) Use and pass the proper environment variables to the processes.
2023-07-30 16:54:02 +02:00
Viktor Lofgren
5c071ce4d3
(crawler) Clean up the code and remove unnecessary logging
2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8
(crawler) Fix rare issue with NPEs if the crawl queue is empty
2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4
(crawler) Even more memory optimizations.
...
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f
(crawler) Reduce log spam
2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0
(crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size.
2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48
(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.
2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171
(crawler, converter) Remove monkey patched gson from dependencies
2023-07-29 19:18:12 +02:00
Viktor Lofgren
05ba3bab96
(crawler) Make SitemapRetriever abort on too large sitemaps.
2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c
(crawler) Reduce log spam in HttpFetcherImpl
2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
...
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
9ad32ee9c7
(control) Be more clear about when a process exits and why.
2023-07-29 19:16:00 +02:00
Viktor Lofgren
866db6c63f
(control) Dialog for updating message state; clean up file view.
2023-07-28 22:02:05 +02:00
Viktor Lofgren
01476577b8
(loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
...
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10
(converter) Use a dumb thread pool instead of Java's executor service.
2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d
(WIP) Make it possible to sideload encyclopedia data.
...
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
9288d311d4
Add buffering to index journal writer
2023-07-28 18:11:19 +02:00