Architecture

Feature/Embedding Store on Object Storage

Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence and sharing of features across ML models and teams.

5 connections 3 resources

Summary

What it is

Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence and sharing of features across ML models and teams.

Where it fits

Feature stores on S3 decouple feature engineering from model training. Teams write features to S3 once and read them in multiple training jobs and inference pipelines — avoiding redundant feature computation and ensuring consistency across models.

Misconceptions / Traps

S3-based feature stores have higher read latency than in-memory feature stores (Redis, DynamoDB). For online serving with sub-millisecond requirements, S3 is the offline/batch tier, not the serving tier.
Columnar formats (Parquet) enable efficient feature subset selection (projection pruning), but random row access is slow. Design access patterns around batch reads.

Key Connections

scoped_to Object Storage for AI Data Pipelines — feature and embedding persistence
depends_on Apache Parquet — columnar storage format for features
LanceDB scoped_to Feature/Embedding Store on Object Storage — vector-native feature storage

Definition

What it is

Storing pre-computed ML features and vector embeddings on S3 in columnar formats (Parquet, Lance) for offline training, batch inference, and batch retrieval workloads.

Why it exists

Features and embeddings are expensive to compute but reusable across many training runs and inference pipelines. Storing them on S3 in columnar format provides cheap, durable storage with efficient analytical access patterns.

Primary use cases

Offline feature store for ML training, batch embedding storage for RAG, feature versioning and lineage, shared feature repository across teams.

Recent developments

Latest signals

Lance is the ML-native columnar format: 100× faster random access than Parquet. Open lakehouse format for multimodal AI with file format + table format + catalog spec. Lance embeds ANN indexes (IVF-PQ, HNSW) directly inside the dataset — tabular features and vector embeddings live together, indexed and queryable without an external vector DB. Per GitHub — lance-format/lance and Medium — Beyond Parquet: Lance, the ML-Native Data Format.
AWS published 1B+ vector reference architecture on LanceDB + S3. AWS Architecture Blog: scalable + elastic vector database and search solution for 1B+ vectors built on LanceDB on top of Amazon S3 — the cloud-vendor-endorsed reference for the "feature store + vector store on object storage" pattern. Per AWS Architecture Blog — Scalable Elastic Database for 1B+ vectors on LanceDB + S3.
Convert Parquet → Lance in 2 lines of code. Migration path is intentionally easy — lance.write_dataset(parquet_dataset) and existing Parquet pipelines get 100× faster random access + native vector support without rewriting upstream code. Per GitHub — lance-format/lance.
Lance + Iceberg are now complementary (May 2026 framing). Lance for multimodal AI (vectors + features + raw artifacts colocated); Iceberg for the structured-data lakehouse. 2026 architecture: Iceberg as the structured-relational truth, Lance as the AI-native feature/embedding layer pointing back at the same S3 underneath. Per DataLakehouseHub — Lance and Iceberg for Multimodal AI Data (May 2026).
Compatible with Pandas, DuckDB, Polars, PyArrow, PyTorch. Lance ships first-class integrations with the entire Python data-science stack — no proprietary access path. Pull a Lance dataset into PyTorch training, push vectors into LanceDB for retrieval, query the same data from DuckDB for analytics. Per LanceDB — AI-Native Multimodal Lakehouse.
S3 Express One Zone as the high-perf storage path for LanceDB. Soumil Shah's 2026 walkthrough builds an open lakehouse for multimodal AI by pairing LanceDB with S3 Express One Zone — the low-latency NVMe-backed object tier serves the random-access patterns Lance is designed for. Per Medium — Building an Open Lakehouse for Multimodal AI with LanceDB on S3 Express One Zone.