24051fec03
The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory. |
||
---|---|---|
.. | ||
src/main/java/nu/marginalia/tools | ||
build.gradle | ||
readme.md |
Experiment Runner
This tool is a means of launching crawl data processing experiments, for interacting with crawl data.
It's launched with run/experiment.sh
. New experiments need to be added to
ExperimentRunnerMain
in order for the script to be able to run them.