Local AI on S3

You're running local inference. Your data lives on disk, in folders, maybe in a database. At some point, you need durable, searchable, shared storage. This page maps the path from local files to an S3-based data layer — the technologies, formats, and tradeoffs that matter.

Why S3 matters for local AI

1

Your models generate artifacts that outgrow local disk

2

Your retrieval pipeline needs persistent, searchable storage

3

Scattered files, embeddings, and metadata = pipeline chaos

4

S3-compatible storage is the common protocol — self-hosted or cloud

The stack

Storage S3-compatible object stores

SeaweedFS MinIO Garage Ceph Cloudflare R2 Backblaze B2

Formats Columnar + vector formats

Apache Parquet Lance Format Apache Arrow Apache Avro ORC

Table Layer ACID transactions on S3

Apache Iceberg Delta Lake Apache Hudi DuckLake Apache Paimon

Metadata Catalogs + governance

Apache Polaris Project Nessie Unity Catalog lakeFS OpenMetadata

Query Analytics engines

DuckDB Trino ClickHouse Apache Spark DataFusion

Retrieval Vector + hybrid search

LanceDB Weaviate Qdrant Milvus

Start here based on your situation

I need durable storage for local AI

Self-hosted and cloud S3-compatible object stores — the foundation layer everything else builds on.

SeaweedFS MinIO Garage Wasabi

I need retrieval over many files

Vector databases, hybrid indexes, and ML-native formats for search and RAG on S3-stored data.

LanceDB Weaviate Qdrant Milvus

I need structured analytics over S3

Embedded and distributed query engines, table formats, and lakehouse patterns for analytical workloads.

DuckDB DuckLake Apache Iceberg Apache Parquet

I need metadata and indexing

Catalogs, governance, and metadata services that control what lives in object storage.

Apache Polaris Project Nessie Unity Catalog lakeFS

I need to compare tools and formats

Side-by-side evaluations of table formats, vector databases, and streaming engines.

Apache Iceberg Delta Lake Apache Hudi DuckLake

I need to avoid vendor lock-in

Zero-egress providers, open formats, and strategies to keep your data portable.

Cloudflare R2 Backblaze B2 Wasabi Vendor Lock-In

I need durable storage for local AI

SeaweedFS Technology

An open-source distributed storage system with an S3-compatible API, architecturally optimized for billions of small and large fil…

MinIO Technology

An open-source, S3-compatible object storage server designed for high performance and self-hosted deployment. As of February 2026,…

Garage Technology

A lightweight, self-hosted, geo-distributed S3-compatible object storage system designed for small distributed clusters, edge depl…

Wasabi Technology

An S3-compatible cloud storage service with a fixed pricing model — no egress fees, no API request fees, approximately $5–7/TB/mon…

Cloudflare R2 Technology

An S3-compatible object storage service from Cloudflare with zero egress fees, integrated with the Cloudflare global edge network.

Backblaze B2 Technology

A low-cost S3-compatible cloud storage service with free egress to CDN partners through the Bandwidth Alliance, designed for cost-…

JuiceFS Technology

A POSIX-compliant distributed filesystem that uses S3-compatible object storage as its data backend and a separate metadata engine…

Related guides

The Post-MinIO Landscape — Self-Hosted S3 Branches Out

For nearly a decade, "self-hosted S3" meant MinIO. It was the default answer — simple, fast, single-binary, open-source. The February 2026 a...

5 nodes 8 sources

Choosing an S3-Compatible Provider Beyond AWS

The S3 API has become the de facto standard for object storage, but AWS S3 is no longer the only serious option. Cloudflare R2 offers zero e...

3 nodes 9 sources

The Local-First S3 Data Ecosystem — Architecting Resilient AI Pipelines for Constrained Environments

Engineers building AI pipelines on single-node servers, small Docker clusters, or prosumer-grade hardware need to replicate the functionalit...

5 nodes 18 sources

Explore this theme →

I need retrieval over many files

LanceDB Technology

A vector database that stores data in the Lance columnar format directly on object storage. Designed for serverless vector search …

Weaviate Technology

An open-source vector database with hybrid search combining BM25 keyword matching and vector similarity in a single query, plus mu…

Qdrant Technology

A Rust-based vector search engine with native payload filtering and a custom HNSW index implementation that applies metadata filte…

Milvus Technology

A distributed vector database built for billion-scale similarity search, using a microservices architecture with SSD caching for h…

Hybrid S3 + Vector Index Architecture

A pattern that stores raw data on S3 and maintains a vector index over embeddings that points back to S3 objects.

Lance Format Standard

A modern columnar data format optimized for random access and vector search on object storage, providing up to 100x faster random …

Related guides

Choosing a Vector Database for S3 Workloads

The vector database market has fragmented into three distinct architectural tiers, each with a fundamentally different relationship to S3-co...

7 nodes 8 sources

Vector Indexing on Object Storage — What's Real vs. Hype

Vector databases and semantic search are heavily marketed features in the AI ecosystem. For engineers building on S3, the question is practi...

7 nodes 8 sources

S3 Vectors vs. Dedicated DBs

Amazon S3 Vectors (GA 2025) integrates vector storage and similarity search directly into S3, challenging the assumption that RAG pipelines ...

5 nodes 6 sources

The Lance Format — ML-Native Storage Beyond Parquet

Apache Parquet organizes data into monolithic row groups optimized for sequential columnar scans. This layout causes severe I/O bottlenecks ...

5 nodes 4 sources

Explore this theme →

I need structured analytics over S3

DuckDB Technology

An in-process analytical database engine (like SQLite for analytics) that reads Parquet, Iceberg, and other formats directly from …

DuckLake Technology

A lakehouse metadata format that stores table metadata in an embedded SQL database (DuckDB) instead of file-based manifests on S3.…

Apache Iceberg Technology

An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files …

Apache Parquet Standard

A columnar file format specification designed for efficient analytical queries. Stores data by column, enabling predicate pushdown…

Trino Technology

A distributed SQL query engine for federated analytics across heterogeneous data sources, with deep support for S3-backed data lak…

Lakehouse Architecture Architecture

A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access…

Related guides

Where DuckDB Fits (and Where It Doesn't)

Engineers encounter S3-stored data constantly — Parquet files in data lakes, Iceberg tables in lakehouses, ad-hoc exports. Historically, exp...

4 nodes 4 sources

How S3 Shapes Lakehouse Design

Every lakehouse architecture sits on object storage — almost always S3 or an S3-compatible store. But S3 is not a database, and its constrai...

5 nodes 7 sources

Choosing a Table Format — Iceberg vs. Delta vs. Hudi

The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactio...

5 nodes 8 sources

DuckLake and the Future of Lakehouse Metadata

Every open table format — Iceberg, Delta Lake, Hudi — stores its metadata as files on S3. Iceberg writes Avro manifests and JSON table metad...

5 nodes 6 sources

Explore this theme →

I need metadata and indexing

Apache Polaris Technology

An open-source REST catalog for Apache Iceberg with centralized RBAC, originally developed by Snowflake and donated to Apache.

Project Nessie Technology

An open-source transactional catalog for data lakes that provides Git-like branching, tagging, and commit semantics for Iceberg ta…

Unity Catalog Technology

An open-source, multi-format data catalog by Databricks (Linux Foundation), supporting Iceberg, Delta Lake, Hudi, and unstructured…

lakeFS Technology

A Git-like version control system for data lakes on S3, providing branching, committing, merging, and rollback for datasets stored…

OpenMetadata Technology

An open-source metadata platform providing a centralized catalog for data discovery, quality, lineage, and governance across S3-ba…

DataHub Technology

An open-source metadata platform originally developed at LinkedIn that provides data discovery, lineage tracking, governance, and …

Related guides

The Great Catalog Migration

The "Metastore Era" — dominated by the Hive Metastore (HMS) — is ending. As Iceberg overtakes Hive-style tables, organizations must migrate ...

5 nodes 6 sources

The Catalog Wars — Apache Polaris vs. Unity Catalog

The metadata catalog has replaced the table format as the critical vendor lock-in layer. While Iceberg, Delta, and Hudi compete at the file ...

4 nodes 4 sources

Credential Vending in Modern Data Lakes

Distributing static IAM access keys to distributed compute clusters — Spark executors, Trino workers, Flink task managers — is a security li...

4 nodes 4 sources

Explore this theme →

I need to compare tools and formats

Apache Iceberg Technology

An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files …

Delta Lake Technology

An open table format and storage layer providing ACID transactions, scalable metadata, and schema enforcement on data stored in ob…

Apache Hudi Technology

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture…

DuckLake Technology

A lakehouse metadata format that stores table metadata in an embedded SQL database (DuckDB) instead of file-based manifests on S3.…

LanceDB Technology

A vector database that stores data in the Lance columnar format directly on object storage. Designed for serverless vector search …

Weaviate Technology

An open-source vector database with hybrid search combining BM25 keyword matching and vector similarity in a single query, plus mu…

Qdrant Technology

A Rust-based vector search engine with native payload filtering and a custom HNSW index implementation that applies metadata filte…

Related guides

Choosing a Table Format — Iceberg vs. Delta vs. Hudi

The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactio...

5 nodes 8 sources

Choosing a Vector Database for S3 Workloads

The vector database market has fragmented into three distinct architectural tiers, each with a fundamentally different relationship to S3-co...

7 nodes 8 sources

Table Format Interoperability — XTable, UniForm, and the End of Format Lock-In

Organizations that use multiple table formats — Iceberg for analytics, Delta for Spark-native pipelines, Hudi for CDC — face a metadata frag...

4 nodes 3 sources

Python-Native Stream Processing — Bytewax vs. Flink for S3 Ingestion

Real-time ingestion into S3 lakehouses has traditionally meant Apache Flink — a distributed, stateful stream processor with mature Iceberg s...

4 nodes 7 sources

I need to avoid vendor lock-in

Cloudflare R2 Technology

An S3-compatible object storage service from Cloudflare with zero egress fees, integrated with the Cloudflare global edge network.

Backblaze B2 Technology

A low-cost S3-compatible cloud storage service with free egress to CDN partners through the Bandwidth Alliance, designed for cost-…

Wasabi Technology

An S3-compatible cloud storage service with a fixed pricing model — no egress fees, no API request fees, approximately $5–7/TB/mon…

Vendor Lock-In Pain Point

Dependence on a single S3 provider's proprietary features, pricing, or integrations that makes migration difficult.

Egress Cost Pain Point

The cost charged by cloud providers for data transferred out of their S3 service — to the internet, another region, or another clo…

OpenDAL Technology

A unified data access layer providing a single API for accessing 40+ storage backends including S3, GCS, Azure Blob, HDFS, and loc…

Related guides

Egress, Lock-In, and the Case for S3-Compatible Alternatives

AWS S3 egress pricing and proprietary feature creep create a gravitational well: data flows in cheaply but flows out expensively. For organi...

5 nodes 9 sources

Zero-Egress Architecture — Multi-Cloud Without the Bandwidth Tax

Egress fees dominate storage TCO for high-bandwidth workloads: multi-cloud AI training, edge inference, CDN origins, and cross-region analyt...

4 nodes 3 sources

Choosing an S3-Compatible Provider Beyond AWS

The S3 API has become the de facto standard for object storage, but AWS S3 is no longer the only serious option. Cloudflare R2 offers zero e...

3 nodes 9 sources

Explore this theme →

What do you want to do?

Understand the landscape

Guides that explain the ecosystem

How S3 Shapes Lakehouse Design Choosing a Table Format — Iceberg vs. Delta vs. Hudi Small Files Problem — Why It Exists and the Common Mitigations Where DuckDB Fits (and Where It Doesn't)

All 36 guides →

Pick tools for a stack

Key technology nodes to evaluate

DuckDB LanceDB Apache Iceberg SeaweedFS

Explore tools →

Decide between options

Comparison guides and head-to-heads

Vector Indexing on Object Storage — What's Real vs. Hype Choosing a Table Format — Iceberg vs. Delta vs. Hudi Choosing an S3-Compatible Provider Beyond AWS S3 Vectors vs. Dedicated DBs

Solve a specific problem

Pain points with the most solutions

Vendor Lock-In Cold Scan Latency Schema Evolution Egress Cost

All pain points →