Engineer Guides

8 cross-cutting guides anchored to the S3 node graph

Guide 1

How S3 Shapes Lakehouse Design

Every lakehouse architecture sits on object storage — almost always S3 or an S3-compatible store. But S3 is not a database, and its constraints fundamentally shape how lakehouses are designed. Enginee...

5 related nodes 7 sources
Guide 2

Small Files Problem — Why It Exists and the Common Mitigations

A dataset with 10 million 10KB files performs worse on S3 than the same data in 100 files of 1GB each. The small files problem is the most common performance issue in S3-based systems, and it is cause...

5 related nodes 6 sources
Guide 3

Why Iceberg Exists (and What It Replaces)

Before Iceberg, querying data on S3 meant pointing a Hive Metastore at a directory of Parquet files and hoping for the best. There were no transactions, schema changes required rewriting data, partiti...

5 related nodes 7 sources
Guide 4

Where DuckDB Fits (and Where It Doesn't)

Engineers encounter S3-stored data constantly — Parquet files in data lakes, Iceberg tables in lakehouses, ad-hoc exports. Historically, exploring this data required setting up Spark clusters or Trino...

4 related nodes 4 sources
Guide 5

Vector Indexing on Object Storage — What's Real vs. Hype

Vector databases and semantic search are heavily marketed features in the AI ecosystem. For engineers building on S3, the question is practical: can you build production vector search over S3-stored d...

7 related nodes 8 sources
Guide 6

LLMs over S3 Data — Embeddings, Metadata, and Local Inference Constraints

LLMs can extract value from S3-stored data — generating embeddings, extracting metadata, classifying documents, inferring schemas, and translating natural language to SQL. But every one of these opera...

6 related nodes 10 sources
Guide 7

Choosing a Table Format — Iceberg vs. Delta vs. Hudi

The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactional table semantics to files on S3. But they solve it differ...

5 related nodes 8 sources
Guide 8

Egress, Lock-In, and the Case for S3-Compatible Alternatives

AWS S3 egress pricing and proprietary feature creep create a gravitational well: data flows in cheaply but flows out expensively. For organizations with multi-cloud strategies, data sovereignty requir...

5 related nodes 9 sources