CatgirlIntelligenceAgency/code/tools/term-frequency-extractor
Viktor Lofgren c41e68aaab (control) New export actions for RSS/Atom feeds and term frequency data
This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.
2024-01-15 14:54:26 +01:00
..
src/main/java/nu/marginalia/tools (control) New export actions for RSS/Atom feeds and term frequency data 2024-01-15 14:54:26 +01:00
build.gradle (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
readme.md Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00

Term Frequency Extractor

Generates a term frequency dictionary file from a batch of crawl data.

Usage:

PATH_TO_SAMPLES=run/samples/crawl-s
export JAVA_OPTS=-Dcrawl.rootDirRewrite=/crawl:${PATH_TO_SAMPLES} 

term-frequency-extractor ${PATH_TO_SAMPLES}/plan.yaml out.dat

See Also