Technology

DuckDB

Summary

What it is

An in-process analytical database engine (like SQLite for analytics) that reads Parquet, Iceberg, and other formats directly from S3 without requiring a server or cluster.

Where it fits

DuckDB fills the gap between "I need to explore this S3 data" and "I need to deploy a Spark cluster." It brings fast columnar analytics to a single machine, reading S3 data directly — ideal for development, ad-hoc analysis, and embedded analytics.

Misconceptions / Traps

DuckDB is single-node. It does not scale horizontally. For petabyte-scale queries, you still need Spark, Trino, or StarRocks.
DuckDB reads from S3 over HTTP. Performance is bottlenecked by network throughput and S3 request latency, especially with many small files.

Key Connections

depends_on Apache Parquet, Apache Arrow — reads Parquet, processes in Arrow format
constrained_by Small Files Problem, Object Listing Performance — performance degrades with too many small S3 objects
Natural Language Querying augments DuckDB — LLMs can generate SQL for DuckDB
scoped_to S3, Lakehouse

Definition

What it is

An in-process analytical database engine (similar to SQLite for analytics) that can directly read Parquet, Iceberg, and other formats from S3 without requiring a server or cluster.

Why it exists

Not every analytical query requires a distributed cluster. DuckDB brings fast columnar analytics to a single machine, reading directly from S3 — eliminating the need to copy data to a local database or set up distributed infrastructure.

Primary use cases

Local S3 data exploration, ad-hoc analytics over Parquet files on S3, development and testing of queries before deploying to distributed engines, embedded analytics.

Relationships

Outbound Relationships

scoped_to

S3 Lakehouse

depends_on

Apache Parquet Apache Arrow

constrained_by

Small Files Problem Object Listing Performance

Inbound Relationships

used_by

Apache Parquet Apache Arrow

augments

Natural Language Querying

Resources

DocsHigh

duckdb.org/docs/

Official DuckDB documentation covering SQL dialect, extensions, and embedded analytics engine capabilities.

GitHubHigh

github.com/duckdb/duckdb

Primary DuckDB repository with the full C++ source, extension framework, and build system.

DocsHigh

duckdb.org/docs/extensions/httpfs/s3api

DuckDB's dedicated S3 support documentation covering direct S3 reads/writes via the httpfs extension, credentials configuration, and Parquet-on-S3 queries.