Data Lake
The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.
Summary
The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.
Data lakes are the precursor to lakehouses. In the S3 world, a data lake is the simplest form — dump everything into S3 and figure out the schema later. Lakehouses add the structure that data lakes lack.
- "Schema-on-read" does not mean "no schema." Without any schema management, data lakes become data swamps — undiscoverable and untrusted.
- Data lakes and lakehouses are not mutually exclusive. Most lakehouses include raw data lake zones (e.g., Medallion Bronze layer).
is_aObject Storage — a data lake is a use of object storagescoped_toS3 — S3 is the dominant storage layer for data lakes- Apache Spark
scoped_toData Lake — the primary compute engine for lake workloads - Apache Flink
scoped_toData Lake — streaming ingestion into lakes - Write-Audit-Publish
scoped_toData Lake — quality gating pattern for lake data
Definition
The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.
Organizations needed a central, low-cost repository for all data types (structured, semi-structured, unstructured) without requiring schema decisions at write time.
Connections 15
Outbound 2
is_a1scoped_to1Inbound 13
scoped_to13Resources 3
AWS's official whitepaper on building data lakes defines architecture patterns, ingestion strategies, and governance frameworks for production data lakes on S3.
AWS's conceptual overview explains what a data lake is, how it differs from a data warehouse, and the key design principles.
Microsoft Azure's data lake overview provides an alternative cloud vendor's perspective, reinforcing the vendor-agnostic nature of the concept.