Feature/Embedding Store on Object Storage
Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence and sharing of features across ML models and teams.
Summary
Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence and sharing of features across ML models and teams.
Feature stores on S3 decouple feature engineering from model training. Teams write features to S3 once and read them in multiple training jobs and inference pipelines — avoiding redundant feature computation and ensuring consistency across models.
- S3-based feature stores have higher read latency than in-memory feature stores (Redis, DynamoDB). For online serving with sub-millisecond requirements, S3 is the offline/batch tier, not the serving tier.
- Columnar formats (Parquet) enable efficient feature subset selection (projection pruning), but random row access is slow. Design access patterns around batch reads.
scoped_toObject Storage for AI Data Pipelines — feature and embedding persistencedepends_onApache Parquet — columnar storage format for features- LanceDB
scoped_toFeature/Embedding Store on Object Storage — vector-native feature storage
Definition
Storing pre-computed ML features and vector embeddings on S3 in columnar formats (Parquet, Lance) for offline training, batch inference, and batch retrieval workloads.
Features and embeddings are expensive to compute but reusable across many training runs and inference pipelines. Storing them on S3 in columnar format provides cheap, durable storage with efficient analytical access patterns.
Offline feature store for ML training, batch embedding storage for RAG, feature versioning and lineage, shared feature repository across teams.
Recent developments
- Lance is the ML-native columnar format: 100× faster random access than Parquet. Open lakehouse format for multimodal AI with file format + table format + catalog spec. Lance embeds ANN indexes (IVF-PQ, HNSW) directly inside the dataset — tabular features and vector embeddings live together, indexed and queryable without an external vector DB. Per GitHub — lance-format/lance and Medium — Beyond Parquet: Lance, the ML-Native Data Format.
- AWS published 1B+ vector reference architecture on LanceDB + S3. AWS Architecture Blog: scalable + elastic vector database and search solution for 1B+ vectors built on LanceDB on top of Amazon S3 — the cloud-vendor-endorsed reference for the "feature store + vector store on object storage" pattern. Per AWS Architecture Blog — Scalable Elastic Database for 1B+ vectors on LanceDB + S3.
- Convert Parquet → Lance in 2 lines of code. Migration path is intentionally easy —
lance.write_dataset(parquet_dataset)and existing Parquet pipelines get 100× faster random access + native vector support without rewriting upstream code. Per GitHub — lance-format/lance. - Lance + Iceberg are now complementary (May 2026 framing). Lance for multimodal AI (vectors + features + raw artifacts colocated); Iceberg for the structured-data lakehouse. 2026 architecture: Iceberg as the structured-relational truth, Lance as the AI-native feature/embedding layer pointing back at the same S3 underneath. Per DataLakehouseHub — Lance and Iceberg for Multimodal AI Data (May 2026).
- Compatible with Pandas, DuckDB, Polars, PyArrow, PyTorch. Lance ships first-class integrations with the entire Python data-science stack — no proprietary access path. Pull a Lance dataset into PyTorch training, push vectors into LanceDB for retrieval, query the same data from DuckDB for analytics. Per LanceDB — AI-Native Multimodal Lakehouse.
- S3 Express One Zone as the high-perf storage path for LanceDB. Soumil Shah's 2026 walkthrough builds an open lakehouse for multimodal AI by pairing LanceDB with S3 Express One Zone — the low-latency NVMe-backed object tier serves the random-access patterns Lance is designed for. Per Medium — Building an Open Lakehouse for Multimodal AI with LanceDB on S3 Express One Zone.
Connections 5
Outbound 4
Inbound 1
enables1Resources 3
Feast feature store documentation with S3 as an offline store backend for feature retrieval and training dataset generation.
Hopsworks feature store overview with S3-backed feature group storage and online/offline serving architecture.
LanceDB documentation for serverless embedding storage and vector search directly on S3.