Technology

DuckDB

Summary

What it is

An in-process analytical database engine (like SQLite for analytics) that reads Parquet, Iceberg, and other formats directly from S3 without requiring a server or cluster.

Where it fits

DuckDB fills the gap between "I need to explore this S3 data" and "I need to deploy a Spark cluster." It brings fast columnar analytics to a single machine, reading S3 data directly — ideal for development, ad-hoc analysis, and embedded analytics.

Misconceptions / Traps

  • DuckDB is single-node. It does not scale horizontally. For petabyte-scale queries, you still need Spark, Trino, or StarRocks.
  • DuckDB reads from S3 over HTTP. Performance is bottlenecked by network throughput and S3 request latency, especially with many small files.

Key Connections

  • depends_on Apache Parquet, Apache Arrow — reads Parquet, processes in Arrow format
  • constrained_by Small Files Problem, Object Listing Performance — performance degrades with too many small S3 objects
  • Natural Language Querying augments DuckDB — LLMs can generate SQL for DuckDB
  • scoped_to S3, Lakehouse

Definition

What it is

An in-process analytical database engine (similar to SQLite for analytics) that can directly read Parquet, Iceberg, and other formats from S3 without requiring a server or cluster.

Why it exists

Not every analytical query requires a distributed cluster. DuckDB brings fast columnar analytics to a single machine, reading directly from S3 — eliminating the need to copy data to a local database or set up distributed infrastructure.

Primary use cases

Local S3 data exploration, ad-hoc analytics over Parquet files on S3, development and testing of queries before deploying to distributed engines, embedded analytics.

Relationships

Outbound Relationships

scoped_to

Inbound Relationships

Resources