Commit Graph

904 Commits

Author SHA1 Message Date
Viktor Lofgren
05ba3bab96 (crawler) Make SitemapRetriever abort on too large sitemaps. 2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c (crawler) Reduce log spam in HttpFetcherImpl 2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d (crawler) Reduce long term memory allocation in DomainCrawlFrontier
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
9ad32ee9c7 (control) Be more clear about when a process exits and why. 2023-07-29 19:16:00 +02:00
Viktor Lofgren
866db6c63f (control) Dialog for updating message state; clean up file view. 2023-07-28 22:02:05 +02:00
Viktor Lofgren
01476577b8 (loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10 (converter) Use a dumb thread pool instead of Java's executor service. 2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d (WIP) Make it possible to sideload encyclopedia data.
This is mostly a pilot track for sideloading other large websites.

Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
9288d311d4 Add buffering to index journal writer 2023-07-28 18:11:19 +02:00
Viktor Lofgren
77d5e39fe0 Make processed data Serializable 2023-07-28 18:11:19 +02:00
Viktor Lofgren
27e781761d (mq single shot inbox) Flag messages as OK if there is no recipient 2023-07-28 12:04:23 +02:00
Viktor Lofgren
92cac52813 (mq) Add indexes to MESSAGE_QUEUE 2023-07-28 12:03:51 +02:00
Viktor Lofgren
66bb12e55a (converter) File listing and download for file storage 2023-07-26 21:59:35 +02:00
Viktor Lofgren
a5d980ee56 (converter) Hook crawl job extractor and adjacencies calculator into control service. 2023-07-26 15:46:22 +02:00
Viktor Lofgren
19c2ceec9b (converter) Use Marginalia Yellow for control service 2023-07-26 11:50:23 +02:00
Viktor Lofgren
507f26ad47 (converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.

(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd (loader) Don't delete the entire link database when the loader runs 2023-07-24 18:37:35 +02:00
Viktor Lofgren
09fd0a1d0e (converter) Automatically clean stale file storage records if they disappear on disk 2023-07-24 17:04:42 +02:00
Viktor Lofgren
667b0ca0b0 (converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798 (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
Viktor Lofgren
7470c170b1 (minor) EdgeUrl.parse() should deal with null 2023-07-24 15:06:57 +02:00
Viktor Lofgren
bc330acfc9 (control) Better refresh script that doesn't cause weird artifacts 2023-07-23 19:26:16 +02:00
Viktor Lofgren
789e8eea85 (crawler) Clean up and refactor the code a bit 2023-07-23 19:08:38 +02:00
Viktor Lofgren
35b29e4f9e (crawler) Clean up and refactor the code a bit 2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf (crawler) Clean up and refactor the code a bit 2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182 (crawler) Clean up crawl data reference and recrawl logic 2023-07-22 18:42:21 +02:00
Viktor Lofgren
9e4aa7da7c (crawler) Support for X-Robots-Tag 2023-07-22 18:42:21 +02:00
Viktor Lofgren
e22e65eee4 (index) Fix bug related to debug print statements 2023-07-22 14:33:58 +02:00
Viktor Lofgren
d6b07e4d01 (controller) Improve the storage interface 2023-07-21 19:56:16 +02:00
Viktor Lofgren
995657c6ce (big-string) Make big-string disable:able 2023-07-21 19:50:35 +02:00
Viktor Lofgren
58f2f86ea8 (crawler) Don't read all the data into RAM when doing a refresh-crawl 2023-07-21 19:47:52 +02:00
Viktor Lofgren
7bc1cff286 (minor) code cleanup 2023-07-21 14:28:37 +02:00
Viktor Lofgren
8f455f3b6d (control) Aborting a process spawner actor cancels the message to the actor. 2023-07-21 14:12:32 +02:00
Viktor Lofgren
f91d92cccb (crawler) WIP 2023-07-20 21:05:16 +02:00
Viktor Lofgren
08ca6399ec (converter) WIP 2023-07-19 17:14:45 +02:00
Viktor Lofgren
c0b5ea0e7d Revert "Less spammy default log settings"
This reverts commit f6e2216b87.
2023-07-18 19:28:42 +02:00
Viktor Lofgren
f21a3983aa Abortable processes 2023-07-18 18:40:12 +02:00
Viktor Lofgren
f6e2216b87 Less spammy default log settings 2023-07-17 21:42:13 +02:00
Viktor Lofgren
92ed513e4f Less spammy default log settings 2023-07-17 21:41:56 +02:00
Viktor Lofgren
d7ab21fe34 (*) Refactor Control Service and processes 2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8 (*) Refactor MQ and MQSM 2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9 (control) Name change process->fsm, new fsm:s
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
6e41e78f36 (control) Higlight missing processes 2023-07-16 12:03:32 +02:00
Viktor Lofgren
c4dd9a0547 (control) Use MQFSMs to monitor and spawn processes when messages are sent to them 2023-07-16 11:58:47 +02:00
Viktor Lofgren
5ec10634d8 (mqfsm) Abortable state machine 2023-07-15 14:12:16 +02:00
Viktor Lofgren
cdae74d395 (control) Working redirects 2023-07-15 14:11:59 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor Lofgren
23169ad818 (db) Model for file storage areas 2023-07-14 11:40:05 +02:00
Viktor Lofgren
d36e36c8fd (mq) Bugfix lastNMessages; use Lists.reverse properly 2023-07-14 11:39:15 +02:00
Viktor Lofgren
948d4d5f08 (control) Clean up the number of GUI views, abortable FSM tasks 2023-07-13 17:24:21 +02:00