Nimble | LLMS3

Summary

What it is

A columnar file format from Meta, purpose-built for ML feature engineering on wide tables (10K+ columns), using block encoding for bounded memory and Flatbuffers metadata for SIMD/GPU-efficient decode. Open-sourced as `facebookincubator/nimble`.

Where it fits

Nimble targets a workload Parquet was never designed for — ultra-wide ML feature stores with tens of thousands of columns, where Parquet's metadata overhead and stream encoding become prohibitive. Joins Vortex and Lance as the post-Parquet AI-format trio, each optimizing a different access pattern.

Misconceptions / Traps

Nimble is not a general-purpose Parquet replacement. It optimizes specifically for wide-table ML workloads; for narrow analytics tables Parquet remains efficient.
Ecosystem support is narrower than Parquet — primarily Meta-internal tooling plus the public open-source build.
Block encoding gives bounded memory but reads more data per block than streaming readers — only beneficial when column count justifies the trade.

Key Connections

alternative_to Apache Parquet — wide-table ML workloads
scoped_to Table Formats, S3

Definition

What it is

A columnar file format developed at **Meta**, purpose-built for **machine-learning feature engineering on wide tables** — datasets with tens of thousands of columns. Replaces Parquet's stream encoding with **block encoding** for predictable, bounded memory usage and substitutes Thrift/Protobuf with **Flatbuffers** for lightweight metadata that decodes efficiently on SIMD CPUs and GPUs. Open-sourced as `facebookincubator/nimble`.

Why it exists

Production ML feature stores at hyperscaler scale routinely exceed 10,000 columns per table — a regime Parquet was never designed for. Parquet's Thrift metadata, eager column-chunk loading, and per-column statistics overhead become prohibitive at that width. Nimble guarantees per-column memory bounds via block encoding and enables fast metadata traversal via Flatbuffers, making wide-table reads viable on accelerator hardware.

Primary use cases

ML feature engineering on wide tables (10K+ columns), recommender system training datasets, large-scale embedding tables, GPU-accelerated columnar reads, Meta-scale feature stores.