Architecture

Feature/Embedding Store on Object Storage

Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence and sharing of features across ML models and teams.

5 connections 3 resources

Summary

What it is

Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence and sharing of features across ML models and teams.

Where it fits

Feature stores on S3 decouple feature engineering from model training. Teams write features to S3 once and read them in multiple training jobs and inference pipelines — avoiding redundant feature computation and ensuring consistency across models.

Misconceptions / Traps
  • S3-based feature stores have higher read latency than in-memory feature stores (Redis, DynamoDB). For online serving with sub-millisecond requirements, S3 is the offline/batch tier, not the serving tier.
  • Columnar formats (Parquet) enable efficient feature subset selection (projection pruning), but random row access is slow. Design access patterns around batch reads.
Key Connections
  • scoped_to Object Storage for AI Data Pipelines — feature and embedding persistence
  • depends_on Apache Parquet — columnar storage format for features
  • LanceDB scoped_to Feature/Embedding Store on Object Storage — vector-native feature storage

Definition

What it is

Storing pre-computed ML features and vector embeddings on S3 in columnar formats (Parquet, Lance) for offline training, batch inference, and batch retrieval workloads.

Why it exists

Features and embeddings are expensive to compute but reusable across many training runs and inference pipelines. Storing them on S3 in columnar format provides cheap, durable storage with efficient analytical access patterns.

Primary use cases

Offline feature store for ML training, batch embedding storage for RAG, feature versioning and lineage, shared feature repository across teams.

Connections 5

Outbound 4
Inbound 1

Resources 3