1b8b97b8ec
Tar files will reject entries with filenames over 100b, so we need a limit there. Also added a maximum size limit to keep the file sizes reasonable. |
||
---|---|---|
.. | ||
src/main/java/nu/marginalia/extractor | ||
build.gradle | ||
readme.md |
Contains converter-like extraction jobs that operate on crawled data to produce export files.
Important classes
- AtagExporter - extracts anchor texts from the crawled data.
- FeedExporter - tries to find RSS/Atom feeds within the crawled data.
- TermFrequencyExporter - exports the 'TF' part of TF-IDF.