Parquet Format Guide
Columnar storage format for large-scale training datasets
Training DataSpecification
Apache Parquet is a columnar storage file format designed for efficient data storage and retrieval in big data processing frameworks. Originally developed at Twitter and Cloudera and open-sourced through the Apache Software Foundation in 2013, Parquet stores data by column rather than by row, enabling dramatically better compression ratios and faster analytical queries compared to row-oriented formats like CSV or JSONL. It has become the standard format for large-scale dataset storage on Hugging Face Hub and in enterprise data lakes.
Parquet files are organized into row groups, each containing column chunks that store data for a single column. Each column chunk is further divided into pages — the smallest unit of storage that can be individually compressed and encoded. This hierarchical structure enables highly efficient partial reads: when a query needs only a few columns from a dataset with hundreds of columns, Parquet reads only the relevant column chunks, skipping all irrelevant data. Column-level statistics (min, max, null count) stored in the metadata enable predicate pushdown, where entire row groups can be skipped if their statistics indicate they cannot contain matching rows.
Parquet supports multiple compression codecs including Snappy (fast, moderate compression), Zstandard (excellent compression ratio with good speed), Gzip (wide compatibility), LZ4 (lowest latency), and Brotli (highest compression). It supports complex nested data types including structs, lists, and maps, using the Dremel encoding scheme that efficiently handles repeated and optional fields. The format's self-describing schema, stored in the file footer, ensures that readers can interpret the data without external schema definitions.
When to Use Parquet
Parquet is the optimal format for storing large training datasets (hundreds of megabytes to terabytes) that benefit from compression and efficient columnar access. On Hugging Face Hub, Parquet is the default format for the datasets library, enabling features like dataset streaming (loading data on-the-fly without downloading entire files) and server-side filtering. If you are publishing datasets for community use or working with datasets that exceed available RAM, Parquet should be your storage format.
Choose Parquet over JSONL when your dataset is large and you need efficient storage, fast column-level access, or integration with big data tools like Apache Spark, DuckDB, Polars, or BigQuery. A JSONL dataset that occupies 10 GB on disk might compress to 2-3 GB in Parquet while also providing faster read access for common operations. Choose Parquet over CSV for any dataset with complex types, Unicode text, or more than a few hundred megabytes of data.
Parquet is less suitable when you need line-by-line streaming into training frameworks that expect JSONL input (convert Parquet to JSONL for the final training step), when you need to manually inspect or edit individual records (Parquet is binary and not human-readable), or when your dataset is very small and the overhead of the Parquet format exceeds the benefits of compression.
Schema / Structure
Parquet File Structure:
┌──────────────────────────────────────┐
│ Row Group 0 │
│ ├── Column Chunk: "text" │
│ │ ├── Page 0 (data + encoding) │
│ │ └── Page 1 (data + encoding) │
│ ├── Column Chunk: "label" │
│ │ └── Page 0 (data + encoding) │
│ └── Column Chunk: "metadata" │
│ └── Page 0 (data + encoding) │
├──────────────────────────────────────┤
│ Row Group 1 │
│ ├── Column Chunk: "text" │
│ ├── Column Chunk: "label" │
│ └── Column Chunk: "metadata" │
├──────────────────────────────────────┤
│ Footer │
│ ├── File Metadata │
│ │ ├── Schema (column names/types)│
│ │ ├── Row group metadata │
│ │ └── Column statistics │
│ └── Footer length (4 bytes) │
│ Magic number: "PAR1" (4 bytes) │
└──────────────────────────────────────┘Example Data
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
# Create a training dataset and save as Parquet
data = {
"text": [
"The product quality exceeded my expectations",
"Terrible customer service, waited 3 hours",
"Average experience, nothing special",
],
"label": ["positive", "negative", "neutral"],
"confidence": [0.95, 0.88, 0.72],
}
table = pa.Table.from_pydict(data)
pq.write_table(table, "training_data.parquet", compression="zstd")
# Read specific columns (efficient columnar access)
df = pq.read_table("training_data.parquet", columns=["text", "label"]).to_pandas()
print(df.head())
# Stream large Parquet files with Hugging Face datasets
from datasets import load_dataset
ds = load_dataset("parquet", data_files="training_data.parquet", streaming=True)
for example in ds["train"]:
print(example["text"], example["label"])Ertas Support
Ertas Data Suite supports Parquet import and export for efficient handling of large-scale training datasets. You can import Parquet files from data lakes and warehouses, apply PII redaction and data transformations, and export processed datasets in either Parquet (for storage efficiency) or JSONL (for training framework compatibility). The data lineage system tracks transformations at the row group level, maintaining provenance even across format conversions.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.