CatgirlIntelligenceAgency/code/tools/experiment-runner
Viktor Lofgren 440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
..
src/main/java/nu/marginalia/tools (crawler) WIP integration of WARC files into the crawler and converter process. 2023-12-13 15:33:42 +01:00
build.gradle Initial Commit Anchor Tags 2023-11-04 14:24:17 +01:00
readme.md Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00

Experiment Runner

This tool is a means of launching crawl data processing experiments, for interacting with crawl data.

It's launched with run/experiment.sh. New experiments need to be added to ExperimentRunnerMain in order for the script to be able to run them.