Guide 4

Where DuckDB Fits (and Where It Doesn't)

Problem Framing

Engineers encounter S3-stored data constantly — Parquet files in data lakes, Iceberg tables in lakehouses, ad-hoc exports. Historically, exploring this data required setting up Spark clusters or Trino coordinators. DuckDB changes the equation by bringing fast columnar analytics to a single machine, reading directly from S3. But knowing when DuckDB is the right tool — and when it is not — prevents both over-engineering and under-performing.

Relevant Nodes

Topics: S3, Lakehouse
Technologies: DuckDB, Trino, Apache Spark, ClickHouse, StarRocks
Standards: Apache Parquet, Apache Arrow
Pain Points: Small Files Problem, Object Listing Performance, Cold Scan Latency

Decision Path

Use DuckDB when:
- You need ad-hoc exploration of S3 data (quick SELECT against a few Parquet files)
- You are developing and testing queries before deploying them to Spark or Trino
- You need embedded analytics in an application (DuckDB runs in-process, no server needed)
- Your data fits in a single machine's processing capacity (up to ~100GB of result sets, much more for streaming scans)
- You want to query Iceberg tables on S3 without deploying a cluster
Do not use DuckDB when:
- Data volume requires distributed processing (petabyte-scale joins, multi-TB shuffles)
- You need concurrent multi-user access (DuckDB is single-process)
- You need to write to table formats on S3 in production pipelines (use Spark/Flink)
- You are querying millions of small files on S3 (DuckDB is constrained by S3 listing performance)
DuckDB + S3 configuration:
- Use the httpfs extension for S3 access with credential configuration
- DuckDB supports reading Parquet, CSV, JSON, and Iceberg directly from S3 URIs
- Arrow integration enables zero-copy data exchange with Python analytics libraries
- Parallel S3 reads improve throughput for larger datasets
DuckDB vs. alternatives (quick reference):
- DuckDB vs. Spark: DuckDB for single-machine, interactive; Spark for distributed, production pipelines
- DuckDB vs. Trino: DuckDB for local exploration; Trino for multi-user, multi-source, federated queries
- DuckDB vs. ClickHouse: DuckDB for embedded/serverless; ClickHouse for persistent, low-latency dashboards
- DuckDB vs. StarRocks: DuckDB for development; StarRocks for production analytics with caching

What Changed Over Time

DuckDB started as an academic project (CWI Amsterdam) focused on in-process OLAP — the "SQLite for analytics."
S3 support came via the httpfs extension, making DuckDB immediately useful for data lake exploration.
Iceberg support expanded DuckDB from "Parquet file reader" to "lakehouse query tool" — querying table format metadata, not just raw files.
The "DuckDB for everything" trend has led to engineers using it beyond its design envelope. Single-machine performance is excellent but has a ceiling.
Integration with Python (pandas, Polars, Arrow) has made DuckDB the default local analytics tool for data engineers.

Problem Framing

Relevant Nodes

Decision Path

What Changed Over Time

Sources