Architecture

Lakehouse for AI Workflows

The architectural pattern of using governed, ACID-transactional lakehouse tables on S3 as the single data substrate for AI/ML pipelines — including training data management, feature engineering, embedding storage, and model artifact versioning.

6 connections 2 resources

Summary

What it is

The architectural pattern of using governed, ACID-transactional lakehouse tables on S3 as the single data substrate for AI/ML pipelines — including training data management, feature engineering, embedding storage, and model artifact versioning.

Where it fits

Bridges the gap between analytics-oriented lakehouse infrastructure and AI/ML platforms. Instead of copying data from a lakehouse into a separate ML platform, this pattern keeps everything on S3 in open table formats, using the same catalog, access control, and lineage infrastructure. Training runs use time-travel for reproducibility; feature stores are Iceberg tables; embeddings are governed like any other dataset.

Misconceptions / Traps
  • A lakehouse-for-AI is not just "put Parquet files in S3." It requires table format governance (Iceberg/Delta), catalog-based access control, and lineage tracking.
  • Feature stores backed by lakehouse tables may have higher latency than purpose-built feature stores for online serving. The pattern works best for batch/offline ML.
  • Model artifacts (checkpoints, weights) are large binary blobs that don't benefit from table format features. Store them as plain S3 objects with versioning.
Key Connections
  • Extends Lakehouse Architecture into AI/ML territory.
  • Depends on Feature/Embedding Store on Object Storage and Training Data Streaming from Object Storage.
  • Complements RAG over Structured Data by providing governed source data.

Definition

What it is

An architecture that extends lakehouse infrastructure — governed, versioned, ACID-transactional tables on object storage — to serve as the data substrate for AI/ML training, fine-tuning, feature engineering, and inference pipelines.

Why it exists

AI workloads need governed, reproducible access to training data, feature tables, embedding stores, and model artifacts. Rather than duplicating data into specialized AI platforms, a lakehouse-for-AI approach keeps everything on S3 in open table formats, using the same catalog and access control infrastructure that serves analytics. This eliminates data copies, enforces lineage, and allows time-travel for reproducible training runs.

Primary use cases

ML training data management on S3, feature store backed by Iceberg tables, embedding pipelines reading from governed lakehouse tables, model artifact versioning alongside training data.

Connections 6

Outbound 6

Resources 2