Guide 4

Where DuckDB Fits (and Where It Doesn't)

Problem Framing

Engineers encounter S3-stored data constantly — Parquet files in data lakes, Iceberg tables in lakehouses, ad-hoc exports. Historically, exploring this data required setting up Spark clusters or Trino coordinators. DuckDB changes the equation by bringing fast columnar analytics to a single machine, reading directly from S3. But knowing when DuckDB is the right tool — and when it is not — prevents both over-engineering and under-performing.

Relevant Nodes

  • Topics: S3, Lakehouse
  • Technologies: DuckDB, Trino, Apache Spark, ClickHouse, StarRocks
  • Standards: Apache Parquet, Apache Arrow
  • Pain Points: Small Files Problem, Object Listing Performance, Cold Scan Latency

Decision Path

  1. Use DuckDB when:

    • You need ad-hoc exploration of S3 data (quick SELECT against a few Parquet files)
    • You are developing and testing queries before deploying them to Spark or Trino
    • You need embedded analytics in an application (DuckDB runs in-process, no server needed)
    • Your data fits in a single machine's processing capacity (up to ~100GB of result sets, much more for streaming scans)
    • You want to query Iceberg tables on S3 without deploying a cluster
  2. Do not use DuckDB when:

    • Data volume requires distributed processing (petabyte-scale joins, multi-TB shuffles)
    • You need concurrent multi-user access (DuckDB is single-process)
    • You need to write to table formats on S3 in production pipelines (use Spark/Flink)
    • You are querying millions of small files on S3 (DuckDB is constrained by S3 listing performance)
  3. DuckDB + S3 configuration:

    • Use the httpfs extension for S3 access with credential configuration
    • DuckDB supports reading Parquet, CSV, JSON, and Iceberg directly from S3 URIs
    • Arrow integration enables zero-copy data exchange with Python analytics libraries
    • Parallel S3 reads improve throughput for larger datasets
  4. DuckDB vs. alternatives (quick reference):

    • DuckDB vs. Spark: DuckDB for single-machine, interactive; Spark for distributed, production pipelines
    • DuckDB vs. Trino: DuckDB for local exploration; Trino for multi-user, multi-source, federated queries
    • DuckDB vs. ClickHouse: DuckDB for embedded/serverless; ClickHouse for persistent, low-latency dashboards
    • DuckDB vs. StarRocks: DuckDB for development; StarRocks for production analytics with caching

What Changed Over Time

  • DuckDB started as an academic project (CWI Amsterdam) focused on in-process OLAP — the "SQLite for analytics."
  • S3 support came via the httpfs extension, making DuckDB immediately useful for data lake exploration.
  • Iceberg support expanded DuckDB from "Parquet file reader" to "lakehouse query tool" — querying table format metadata, not just raw files.
  • The "DuckDB for everything" trend has led to engineers using it beyond its design envelope. Single-machine performance is excellent but has a ceiling.
  • Integration with Python (pandas, Polars, Arrow) has made DuckDB the default local analytics tool for data engineers.

Sources