Vortex | LLMS3

Summary

What it is

A next-generation open-source columnar file format incubating at the Linux Foundation AI & Data Foundation, designed to supersede Apache Parquet for AI and analytics workloads via zero-copy Arrow integration and compute-on-encoded-data kernels (ALP for floats, FSST for strings).

Where it fits

Vortex sits where Parquet has historically lived — as the file format underneath Iceberg/Delta tables and as DuckDB's input layer — but optimizes for AI access patterns Parquet was never designed for. The Linux Foundation transition (formerly SpiralDB) signals a vendor-neutral path, with first-class DuckDB integration shipped January 2026.

Misconceptions / Traps

Vortex is not a database. It is a file format and encoding layer, equivalent in scope to Parquet.
The "100× faster" headline applies to random access — sequential scans are also 10–20× faster, but the random-access gap is the differentiator.
Compute-on-encoded-data requires the engine to understand the encoding tree. DuckDB does (via the official extension); arbitrary Parquet readers do not.

Key Connections

alternative_to Apache Parquet — successor format for AI workloads
used_by DuckDB — official extension since January 2026
scoped_to Table Formats, S3

Definition

What it is

An open-source columnar file format incubating at the **Linux Foundation AI & Data Foundation**, designed as a next-generation successor to Apache Parquet for AI and analytics workloads. Operates with **zero-copy Apache Arrow integration** so the on-disk and in-memory representations match exactly, eliminating the deserialization tax. Compute kernels execute **directly on encoded data** via specialized encodings (**ALP** for floating-point tensors, **FSST** for variable-length strings) rather than decompressing first. Originally developed at SpiralDB; gained an official **DuckDB extension** in January 2026.

Why it exists

Parquet was architected over a decade ago for batch CPU analytics. Modern AI workloads — wide tables, sparse arrays, high-dimensional vectors, random-access RAG retrieval — exposed its structural limits. Vortex replaces Parquet's Thrift-based metadata, eager decompression, and rigid row-group layout with a **pluggable encoding tree**, lazy evaluation, and Arrow-native memory. Published benchmarks show **100× faster random access** and **10–20× faster sequential scans** vs Parquet, with substantially lower CPU and host memory footprint.

Primary use cases

Successor format for AI training and inference data, RAG retrieval requiring fast random reads, lakehouse table formats migrating off Parquet for ML feature stores, DuckDB analytical queries with embedded compute kernels.