Technology

Kafka Tiered Storage

An Apache Kafka feature (KIP-405) that offloads older log segments from broker-local disks to S3-compatible object storage, extending Kafka's retention capacity without scaling broker storage proportionally.

10 connections 3 resources

Summary

What it is

An Apache Kafka feature (KIP-405) that offloads older log segments from broker-local disks to S3-compatible object storage, extending Kafka's retention capacity without scaling broker storage proportionally.

Where it fits

Kafka Tiered Storage bridges the gap between real-time event streaming and long-term S3 storage. By transparently moving cold log segments to S3, it allows Kafka to serve as both the streaming platform and a long-retention event archive, reducing the need for separate S3 sink connectors for archival.

Misconceptions / Traps
  • Tiered storage does not eliminate the need for local disk entirely. Recent (hot) data still resides on broker disks for low-latency consumption. Broker local storage is still required for active segments.
  • Reading from the tiered (S3) tier has higher latency than reading from local disk. Consumer applications that replay old data will experience S3 GET latency.
  • Not all Kafka distributions implement KIP-405 identically. Confluent's implementation differs from Apache Kafka's in configuration and maturity.
Key Connections
  • scoped_to S3, Object Storage — offloads Kafka log segments to S3
  • enables Event-Driven Ingestion — long-retention event streams without broker scaling
  • used_by Debezium — CDC events benefit from extended retention on S3
  • relates_to Tiered Storage — Kafka-specific instance of the tiered storage pattern

Definition

What it is

A Kafka feature (KIP-405) that offloads older log segments from local broker disks to S3-compatible object storage, enabling virtually unlimited retention without scaling broker storage.

Why it exists

Kafka brokers traditionally store all retained data on local disk, forcing a tradeoff between retention period and disk cost. Tiered storage breaks this constraint by moving cold segments to S3, keeping only hot data on fast local storage.

Primary use cases

Long-term Kafka log retention on S3, cost-effective event replay from object storage, decoupling Kafka retention from broker disk capacity.

Recent developments

Latest signals
  • Apache Kafka 4.0 tiered-storage operations doc — production-grade configuration guidance shipped. The Apache Kafka 4.0 Tiered Storage operations guide formalizes the operator-facing configuration surface for tiered storage: remote.log.storage.system.enable=true flips the broker into two-tier mode; RemoteStorageManager is the pluggable interface for the remote backend lifecycle (S3, HDFS, custom) — Kafka does NOT ship an out-of-the-box RemoteStorageManager, operators choose; RemoteLogMetadataManager defaults to a Kafka-internal-topic implementation with strongly consistent semantics. Practical implication: in 4.0, the operations doc is the canonical reference — older 3.x guidance has gaps the 4.0 doc closes (especially around metadata-listener configuration, partial-segment retention, and remote-log indexing).
  • KIP-405 reaches operational maturity in 2026. Per the Kafka Monthly Digest March 2026 (Red Hat Developer), tiered storage moved from "early-adopter" to "default consideration" for new long-retention deployments through 2025-2026 — driven by the cost calculus (S3 storage at ~10× lower $/GB than broker-attached SSD) and by the architectural maturity of pluggable RemoteStorageManager implementations from AWS, Confluent, and Aiven. The competitive frame: WarpStream and Redpanda's tiered storage went straight to S3-as-primary, while Apache Kafka's tiered storage is S3-as-cold-tier — both architectures coexist, and the choice depends on whether sub-millisecond tail-read latency on cold data matters.

Connections 10

Outbound 7
depends_on1
alternative_to2
Inbound 3

Resources 3