(process-models) Improve documentation

This commit is contained in:
Viktor Lofgren 2024-02-15 12:21:12 +01:00
parent 300b1a1b84
commit 652d151373
2 changed files with 74 additions and 2 deletions

View File

@ -0,0 +1,16 @@
# Crawl Spec
A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:
- `domain`: The domain to be crawled
- `crawlDepth`: The depth to which the domain should be crawled
- `urls`: A list of known URLs to be crawled
Crawl specs are used to define the scope of a crawl in the absence of known domains.
The [CrawlSpecRecord](src/main/java/nu/marginalia/model/crawlspec/CrawlSpecRecord.java) class is
used to represent a record in the crawl spec.
The [CrawlSpecRecordParquetFileReader](src/main/java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileReader.java)
and [CrawlSpecRecordParquetFileWriter](src/main/java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileWriter.java)
classes are used to read and write the crawl spec parquet files.

View File

@ -1,13 +1,69 @@
# Crawling Models
Contains models shared by the [crawling-process](../../processes/crawling-process/) and
Contains crawl data models shared by the [crawling-process](../../processes/crawling-process/) and
[converting-process](../../processes/converting-process/).
To ensure backward compatibility with older versions of the data, the serialization is
abstracted away from the model classes.
The new way of serializing the data is to use parquet files.
The old way was to use zstd-compressed JSON. The old way is still supported
*for now*, but the new way is preferred as it's not only more succinct, but also
significantly faster to read and much more portable. The JSON support will be
removed in the future.
## Central Classes
* [CrawledDocument](src/main/java/nu/marginalia/crawling/model/CrawledDocument.java)
* [CrawledDomain](src/main/java/nu/marginalia/crawling/model/CrawledDomain.java)
### Serialization
These serialization classes automatically negotiate the serialization format based on the
file extension.
Data is accessed through a [SerializableCrawlDataStream](src/main/java/nu/marginalia/crawling/io/SerializableCrawlDataStream.java),
which is a somewhat enhanced Iterator that can be used to read data.
* [CrawledDomainReader](src/main/java/nu/marginalia/crawling/io/CrawledDomainReader.java)
* [CrawledDomainWriter](src/main/java/nu/marginalia/crawling/io/CrawledDomainWriter.java)
* [CrawledDomainWriter](src/main/java/nu/marginalia/crawling/io/CrawledDomainWriter.java)
### Parquet Serialization
The parquet serialization is done using the [CrawledDocumentParquetRecordFileReader](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileReader.java)
and [CrawledDocumentParquetRecordFileWriter](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileWriter.java) classes,
which read and write parquet files respectively.
The model classes are serialized to parquet using the [CrawledDocumentParquetRecord](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecord.java)
The record has the following fields:
* `domain` - The domain of the document
* `url` - The URL of the document
* `ip` - The IP address of the document
* `cookies` - Whether the document has cookies
* `httpStatus` - The HTTP status code of the document
* `timestamp` - The timestamp of the document
* `contentType` - The content type of the document
* `body` - The body of the document
* `etagHeader` - The ETag header of the document
* `lastModifiedHeader` - The Last-Modified header of the document
The easiest way to interact with parquet files is to use [DuckDB](https://duckdb.org/),
which lets you run SQL queries on parquet files (and almost anything else).
e.g.
```sql
$ select httpStatus, count(*) as cnt
from 'my-file.parquet'
group by httpStatus;
┌────────────┬───────┐
│ httpStatus │ cnt │
│ int32 │ int64 │
├────────────┼───────┤
│ 200 │ 43 │
│ 304 │ 4 │
│ 500 │ 1 │
└────────────┴───────┘
```