CatgirlIntelligenceAgency/doc/parquet-howto.md

29 lines
794 B
Markdown
Raw Normal View History

2023-09-24 19:40:45 +02:00
Parquet is used as an intermediate storage format for a lot of processed data.
See [third-party/parquet-floor](../third-party/parquet-floor).
## How to query the data?
[DuckDB](https://duckdb.org/) is probably the best tool for interacting with these files. You can
query them with SQL, like
```sql
SELECT foo,bar FROM 'baz.parquet' ...
```
## How to inspect word metadata from `documentNNNN.parquet` ?
The document keywords records contain repeated values. For debugging these
repeated values, they can be unnested in e.g. DuckDB with a query like
```sql
SELECT word, hex(wordMeta) from
(
SELECT
UNNEST(word) AS word,
UNNEST(wordMeta) AS wordMeta
FROM 'document0000.parquet'
WHERE url='...'
)
WHERE word IN ('foo', 'bar')
```