Guide 24

The Lance Format — ML-Native Storage Beyond Parquet

Problem Framing

Apache Parquet organizes data into monolithic row groups optimized for sequential columnar scans. This layout causes severe I/O bottlenecks for ML workloads that require random access to individual rows, store multimodal data (text, images, vectors, metadata in the same table), and need high-speed data loading with sub-second latency. The Lance format uses fragmented file layouts, adaptive encodings per column, and embedded vector indices (IVF-PQ, HNSW) to deliver database-like point-lookup performance directly from object storage, at a fraction of the cost of a dedicated database.

Relevant Nodes

  • Topics: S3, Object Storage for AI Data Pipelines
  • Technologies: LanceDB, Apache Iceberg
  • Standards: Lance Format, Apache Parquet
  • Architectures: Decoupled Vector Search
  • Pain Points: Cold Scan Latency

Decision Path

  1. Identify whether your workload is random-access or sequential-scan. Parquet excels at sequential columnar scans — reading all values of a column across millions of rows. Lance excels at random access — reading specific rows by ID or by vector similarity. If your primary access pattern is full-table analytics, Parquet remains optimal. If you need point lookups, nearest-neighbor search, or row-level iteration for ML training, Lance provides substantially lower latency.

  2. Evaluate Lance vs. Parquet for your access pattern. Lance's fragmented layout divides data into small, independently addressable fragments. Each fragment has its own index, enabling O(log n) lookup by row ID without scanning the entire file. Parquet requires scanning row group footers and potentially reading entire row groups to locate a single row.

    • Random access: Lance is ~100x faster than Parquet for single-row retrieval on S3.
    • Sequential scan: Parquet and Lance are comparable, with Parquet having a slight edge due to mature columnar encoding optimizations.
  3. Understand fragmented vs. monolithic layouts. Parquet files are self-contained: one file, one footer, row groups laid out sequentially. Lance fragments are small (default ~60K rows), each with its own metadata. New writes append new fragments without rewriting existing data (copy-on-write is optional). This enables fast append-heavy workloads common in ML feature stores and embedding pipelines.

  4. Understand adaptive encodings. Lance uses different encodings per column type: dictionary encoding for low-cardinality strings, fixed-width binary for vectors, run-length for sorted columns. Unlike Parquet, where encoding is set at write time per column chunk, Lance can adapt encoding at the fragment level based on data statistics.

  5. Configure LanceDB for S3-backed vector+tabular storage. LanceDB is the primary query engine for Lance format files on S3. It provides SQL and vector search in a single interface, supports zero-copy reads via Arrow, and handles index management (IVF-PQ, HNSW) transparently.

    • LanceDB runs embedded (in-process) or as a serverless cloud service — no separate database cluster required.
  6. Plan migration from Parquet for ML feature stores. For existing Parquet-based feature stores, migration to Lance involves rewriting data files. Evaluate whether the access pattern improvement justifies the migration cost. A common hybrid approach: keep analytical tables in Iceberg/Parquet, store ML feature tables and embedding stores in Lance/LanceDB.

What Changed Over Time

  • Parquet (2013) was designed for Hadoop-era batch analytics. Its monolithic row group layout assumed sequential scan access.
  • ORC provided an alternative with built-in indexes but remained scan-oriented.
  • Lance (2022) was designed from the ground up for ML workloads: random access, multimodal data, and vector search as first-class operations.
  • LanceDB adoption accelerated in 2024–2025 as embedding pipelines became standard in production AI systems, creating demand for storage formats that could handle vectors alongside structured data.
  • The format landscape is now bifurcated: Parquet for analytics, Lance for ML — with Iceberg providing the transaction layer on top of either.

Sources