(process-models) Improve documentation
This commit is contained in:
parent
300b1a1b84
commit
652d151373
16
code/process-models/crawl-spec/readme.md
Normal file
16
code/process-models/crawl-spec/readme.md
Normal file
@ -0,0 +1,16 @@
|
||||
# Crawl Spec
|
||||
|
||||
A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:
|
||||
|
||||
- `domain`: The domain to be crawled
|
||||
- `crawlDepth`: The depth to which the domain should be crawled
|
||||
- `urls`: A list of known URLs to be crawled
|
||||
|
||||
Crawl specs are used to define the scope of a crawl in the absence of known domains.
|
||||
|
||||
The [CrawlSpecRecord](src/main/java/nu/marginalia/model/crawlspec/CrawlSpecRecord.java) class is
|
||||
used to represent a record in the crawl spec.
|
||||
|
||||
The [CrawlSpecRecordParquetFileReader](src/main/java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileReader.java)
|
||||
and [CrawlSpecRecordParquetFileWriter](src/main/java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileWriter.java)
|
||||
classes are used to read and write the crawl spec parquet files.
|
@ -1,13 +1,69 @@
|
||||
# Crawling Models
|
||||
|
||||
Contains models shared by the [crawling-process](../../processes/crawling-process/) and
|
||||
Contains crawl data models shared by the [crawling-process](../../processes/crawling-process/) and
|
||||
[converting-process](../../processes/converting-process/).
|
||||
|
||||
To ensure backward compatibility with older versions of the data, the serialization is
|
||||
abstracted away from the model classes.
|
||||
|
||||
The new way of serializing the data is to use parquet files.
|
||||
|
||||
The old way was to use zstd-compressed JSON. The old way is still supported
|
||||
*for now*, but the new way is preferred as it's not only more succinct, but also
|
||||
significantly faster to read and much more portable. The JSON support will be
|
||||
removed in the future.
|
||||
|
||||
## Central Classes
|
||||
|
||||
* [CrawledDocument](src/main/java/nu/marginalia/crawling/model/CrawledDocument.java)
|
||||
* [CrawledDomain](src/main/java/nu/marginalia/crawling/model/CrawledDomain.java)
|
||||
|
||||
### Serialization
|
||||
|
||||
These serialization classes automatically negotiate the serialization format based on the
|
||||
file extension.
|
||||
|
||||
Data is accessed through a [SerializableCrawlDataStream](src/main/java/nu/marginalia/crawling/io/SerializableCrawlDataStream.java),
|
||||
which is a somewhat enhanced Iterator that can be used to read data.
|
||||
|
||||
* [CrawledDomainReader](src/main/java/nu/marginalia/crawling/io/CrawledDomainReader.java)
|
||||
* [CrawledDomainWriter](src/main/java/nu/marginalia/crawling/io/CrawledDomainWriter.java)
|
||||
|
||||
### Parquet Serialization
|
||||
|
||||
The parquet serialization is done using the [CrawledDocumentParquetRecordFileReader](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileReader.java)
|
||||
and [CrawledDocumentParquetRecordFileWriter](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileWriter.java) classes,
|
||||
which read and write parquet files respectively.
|
||||
|
||||
The model classes are serialized to parquet using the [CrawledDocumentParquetRecord](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecord.java)
|
||||
|
||||
The record has the following fields:
|
||||
|
||||
* `domain` - The domain of the document
|
||||
* `url` - The URL of the document
|
||||
* `ip` - The IP address of the document
|
||||
* `cookies` - Whether the document has cookies
|
||||
* `httpStatus` - The HTTP status code of the document
|
||||
* `timestamp` - The timestamp of the document
|
||||
* `contentType` - The content type of the document
|
||||
* `body` - The body of the document
|
||||
* `etagHeader` - The ETag header of the document
|
||||
* `lastModifiedHeader` - The Last-Modified header of the document
|
||||
|
||||
The easiest way to interact with parquet files is to use [DuckDB](https://duckdb.org/),
|
||||
which lets you run SQL queries on parquet files (and almost anything else).
|
||||
|
||||
e.g.
|
||||
```sql
|
||||
$ select httpStatus, count(*) as cnt
|
||||
from 'my-file.parquet'
|
||||
group by httpStatus;
|
||||
┌────────────┬───────┐
|
||||
│ httpStatus │ cnt │
|
||||
│ int32 │ int64 │
|
||||
├────────────┼───────┤
|
||||
│ 200 │ 43 │
|
||||
│ 304 │ 4 │
|
||||
│ 500 │ 1 │
|
||||
└────────────┴───────┘
|
||||
```
|
Loading…
Reference in New Issue
Block a user