1328bc4938
This commit cleans up the warc->parquet conversion. Records with a http status other than 200 are now included. The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body. The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful. |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
readme.md |
Crawling Models
Contains models shared by the crawling-process and converting-process.