c73e43f5c9
In the scenario where an operator * Performs a new crawl from spec * Doesn't load the data into the index * Recrawls the data The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file, irrecoverably losing the crawl log making it impossible to load! To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening. More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state. This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited. |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
readme.md |
The executor service is a partitioned service responsible for executing and keeping track of long running maintenance and operational tasks, such as crawling or data processing.
It accomplishes this using the message queue and actor library, which permits program state to survive crashes and reboots. The executor service is closely linked to the control-service, which provides a user interface for much of the executor's functionality.