The Local-First S3 Index for LLM Data Infrastructure
— 397 concepts · 1775 relationships · 48 guidesEach technology, standard, and architecture in the index belongs to one or more topics — the conceptual anchors that define the S3 / AI-memory-infrastructure ecosystem. The seven topics added in the May 16, 2026 wave are highlighted.
Amazon's Simple Storage Service and the broader ecosystem of S3-compatible object storage. The root concept of this e...
The storage paradigm of flat-namespace, HTTP-accessible binary objects with metadata. Data is addressed by bucket and...
The emerging tier of persistent, object-storage-backed memory architecture sitting between GPU HBM and cold S3 — the ...
The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transac...
The convergence of data lake storage (raw files on object storage) with data warehouse capabilities — ACID transactio...
The intersection of large language models and S3-centric data infrastructure. Scoped strictly to cases where LLMs ope...
The layer of standardized orchestration fabrics, communication protocols, model gateways, and agent runtimes that sit...
The practice of building and querying vector indexes over embeddings derived from data stored in S3.
Using S3 as the central data layer for machine learning workflows: storing training data, model checkpoints, feature ...
The discipline of maintaining catalogs, schemas, statistics, and descriptive information about objects and datasets s...
The compliance, audit, lineage, and retention discipline applied to persistent AI memory — extending traditional data...
The practice of deploying S3-compatible object storage on infrastructure that is fully controlled by a specific organ...
The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original f...
Deploying S3-compatible object storage at geographically distributed edge locations with synchronization to a central...
The architectural shift toward minimizing data movement between storage and inference compute — placing computation a...
The set of technologies eliminating CPU bounce-buffers between object storage and GPU memory — establishing direct me...
The discipline of building production retrieval systems that go beyond basic Retrieval-Augmented Generation (RAG) — o...
Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, bra...
A purpose-built storage tier designed for single-digit millisecond latency, using a directory-based namespace within ...
Kubernetes-native provisioning and management of S3 buckets using operators, the Container Object Storage Interface (...
The orchestration of memory and shared state across multi-agent environments — the architectural pattern that enables...
A design philosophy that treats object metadata as a first-class, queryable resource rather than an afterthought. Ena...
The ability to query a dataset as it existed at a previous point in time by leveraging immutable snapshots and metada...
I run local AI. Why do I care about S3?
Guided path from local inference to the S3 storage ecosystem — storage, formats, retrieval, and the tradeoffs that matter.
Architectural shifts as they happen. Each post anchors on a pre-existing pain point and walks through what changed.
The Training I/O Tax: Storage Just Got Repriced by the GPU
Three June 2026 signals — Alibaba's 30% CPFS price hike, a MinIO-vs-Dell RDMA throughput benchmark, and LanceDB's first published Enterprise latency numbers — show the same force at work: AI training I/O is repricing the storage layer. Managed parallel file storage is becoming a premium good, while commodity RDMA object storage closes the throughput gap at ~1% host CPU. The bottleneck moved, and so did the bill.
The Whole Stack Went Open — Weights, Storage, and Sovereignty
In the same quarter that a 1.6-trillion-parameter open-weights model landed on top of last year's closed frontier — and Anthropic's Claude Fable 5 promptly sprinted ahead again — the storage layer underneath it went open too: DeepSeek open-sourced its file system, Europe stood up sovereign S3, and the post-MinIO self-hosted stack matured. The frontier and the floor are pulling apart, and the pattern underneath all of it is the oldest pain point this index maps: vendor lock-in.
The Frontier and the Floor: The AI Stack Just Split in Two
In a single quarter, Anthropic's Claude Fable 5 reset the closed frontier while DeepSeek V4 put frontier-of-last-year capability into open weights at roughly one-fiftieth the price. The two events look like a race. They're the opposite: the AI stack is bifurcating into a closed, expensive frontier for the hardest autonomous work and an open, cheap floor for the high-volume inference that actually runs your data infrastructure. The question stopped being 'which model' and became 'which tier.'
How S3 Shapes Lakehouse Design
Every lakehouse architecture sits on object storage — almost always S3 or an S3-compatible store. But S3 is not a database, and its constrai...
7Choosing a Table Format — Iceberg vs. Delta vs. Hudi
The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactio...
2Small Files Problem — Why It Exists and the Common Mitigations
A dataset with 10 million 10KB files performs worse on S3 than the same data in 100 files of 1GB each. The small files problem is the most c...
4Where DuckDB Fits (and Where It Doesn't)
Engineers encounter S3-stored data constantly — Parquet files in data lakes, Iceberg tables in lakehouses, ad-hoc exports. Historically, exp...