diff --git a/doc/parquet-howto.md b/doc/parquet-howto.md new file mode 100644 index 00000000..519739d8 --- /dev/null +++ b/doc/parquet-howto.md @@ -0,0 +1,29 @@ +Parquet is used as an intermediate storage format for a lot of processed data. + +See [third-party/parquet-floor](../third-party/parquet-floor). + +## How to query the data? + +[DuckDB](https://duckdb.org/) is probably the best tool for interacting with these files. You can +query them with SQL, like + +```sql +SELECT foo,bar FROM 'baz.parquet' ... +``` + +## How to inspect word metadata from `documentNNNN.parquet` ? + +The document keywords records contain repeated values. For debugging these +repeated values, they can be unnested in e.g. DuckDB with a query like + +```sql +SELECT word, hex(wordMeta) from + ( + SELECT + UNNEST(word) AS word, + UNNEST(wordMeta) AS wordMeta + FROM 'document0000.parquet' + WHERE url='...' + ) +WHERE word IN ('foo', 'bar') +``` \ No newline at end of file diff --git a/doc/readme.md b/doc/readme.md index a5da9973..bbc64105 100644 --- a/doc/readme.md +++ b/doc/readme.md @@ -6,7 +6,10 @@ Start in [📁 ../code/](../code/) and poke around. ## Operations * [System Properties](system-properties.md) - JVM property flags + +## How-To * [Sideloading How-To](sideloading-howto.md) - How to sideload various data sets +* [Parquet How-To](parquet-howto.md) - Useful tips in working with Parquet files ## Set-up