Topic

Data Lake

The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.

15 connections 3 resources

Summary

What it is

The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.

Where it fits

Data lakes are the precursor to lakehouses. In the S3 world, a data lake is the simplest form — dump everything into S3 and figure out the schema later. Lakehouses add the structure that data lakes lack.

Misconceptions / Traps

"Schema-on-read" does not mean "no schema." Without any schema management, data lakes become data swamps — undiscoverable and untrusted.
Data lakes and lakehouses are not mutually exclusive. Most lakehouses include raw data lake zones (e.g., Medallion Bronze layer).

Key Connections

is_a Object Storage — a data lake is a use of object storage
scoped_to S3 — S3 is the dominant storage layer for data lakes
Apache Spark scoped_to Data Lake — the primary compute engine for lake workloads
Apache Flink scoped_to Data Lake — streaming ingestion into lakes
Write-Audit-Publish scoped_to Data Lake — quality gating pattern for lake data

Definition

What it is

The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream.

Why it exists

Organizations needed a central, low-cost repository for all data types (structured, semi-structured, unstructured) without requiring schema decisions at write time.

Connections 15

Outbound 2

is_a1

Object Storage

scoped_to1

Inbound 13

scoped_to13

Technology5

Apache Spark Apache Flink Debezium Airbyte dlt

Standard1

Data Contracts

Architecture4

Medallion Architecture Write-Audit-Publish PII Tokenization Event-Driven Ingestion

Pain Point2

Schema Evolution Legacy Ingestion Bottlenecks

Model Class1

Data Quality Validation Models

Resources 3

DocsHigh

docs.aws.amazon.com/whitepapers/latest/building-data-lakes/b...

AWS's official whitepaper on building data lakes defines architecture patterns, ingestion strategies, and governance frameworks for production data lakes on S3.

DocsHigh

aws.amazon.com/what-is/data-lake/

AWS's conceptual overview explains what a data lake is, how it differs from a data warehouse, and the key design principles.

DocsHigh

azure.microsoft.com/en-us/solutions/data-lake/

Microsoft Azure's data lake overview provides an alternative cloud vendor's perspective, reinforcing the vendor-agnostic nature of the concept.