Feature/Embedding Store on Object Storage
Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence and sharing of features across ML models and teams.
Summary
Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence and sharing of features across ML models and teams.
Feature stores on S3 decouple feature engineering from model training. Teams write features to S3 once and read them in multiple training jobs and inference pipelines — avoiding redundant feature computation and ensuring consistency across models.
- S3-based feature stores have higher read latency than in-memory feature stores (Redis, DynamoDB). For online serving with sub-millisecond requirements, S3 is the offline/batch tier, not the serving tier.
- Columnar formats (Parquet) enable efficient feature subset selection (projection pruning), but random row access is slow. Design access patterns around batch reads.
scoped_toObject Storage for AI Data Pipelines — feature and embedding persistencedepends_onApache Parquet — columnar storage format for features- LanceDB
scoped_toFeature/Embedding Store on Object Storage — vector-native feature storage
Definition
Storing pre-computed ML features and vector embeddings on S3 in columnar formats (Parquet, Lance) for offline training, batch inference, and batch retrieval workloads.
Features and embeddings are expensive to compute but reusable across many training runs and inference pipelines. Storing them on S3 in columnar format provides cheap, durable storage with efficient analytical access patterns.
Offline feature store for ML training, batch embedding storage for RAG, feature versioning and lineage, shared feature repository across teams.
Connections 5
Outbound 4
Inbound 1
enables1Resources 3
Feast feature store documentation with S3 as an offline store backend for feature retrieval and training dataset generation.
Hopsworks feature store overview with S3-backed feature group storage and online/offline serving architecture.
LanceDB documentation for serverless embedding storage and vector search directly on S3.