CatgirlIntelligenceAgency/doc/parquet-howto.md
2023-09-24 19:40:45 +02:00

794 B

Parquet is used as an intermediate storage format for a lot of processed data.

See third-party/parquet-floor.

How to query the data?

DuckDB is probably the best tool for interacting with these files. You can query them with SQL, like

SELECT foo,bar FROM 'baz.parquet' ...

How to inspect word metadata from documentNNNN.parquet ?

The document keywords records contain repeated values. For debugging these repeated values, they can be unnested in e.g. DuckDB with a query like

SELECT word, hex(wordMeta) from 
    (
        SELECT 
            UNNEST(word) AS word, 
            UNNEST(wordMeta) AS wordMeta 
        FROM 'document0000.parquet'
        WHERE url='...'
    )
WHERE word IN ('foo', 'bar')