Apache Avro
A row-based data serialization format with rich schema definition and built-in schema evolution support. Schemas are stored with the data.
Summary
A row-based data serialization format with rich schema definition and built-in schema evolution support. Schemas are stored with the data.
Avro is the ingestion format of the S3 ecosystem. Data flowing from Kafka, operational databases, and streaming systems into S3 often arrives in Avro — because Avro's schema-with-data approach handles the frequent schema changes typical of event streams.
- Avro is a row-oriented format. It is efficient for writing and ingestion but inefficient for analytical queries compared to Parquet. Convert to Parquet after landing in S3.
- Avro's schema evolution rules (backward/forward compatibility) are powerful but strict. Breaking changes silently corrupt data if compatibility modes are misconfigured.
used_byApache Spark — a supported input/output formatsolvesSchema Evolution — schema-with-data approach supports evolutionscoped_toS3, Table Formats
Definition
A row-based data serialization format specification with rich schema definition and built-in schema evolution support. Schemas are stored with the data, making files self-describing.
Data arriving into S3 often comes from streaming systems (Kafka) and operational databases where the schema changes frequently. Avro's schema-with-data approach and backward/forward compatibility rules make it ideal for ingestion layers where schema stability cannot be guaranteed.
Streaming data ingestion into S3 (Kafka → S3), schema-evolving event logs, interchange format between systems writing to object storage.
Recent developments
Source mix note: Avro's recent corpus is dominated by integration documentation rather than primary engineering posts on the format itself.
- Schema-registry ecosystem stable across vendors. Per Aiven's Avro Java-class generation docs, the Confluent dependencies version 8.0.0 line remains the reference toolchain for Avro + Java workloads. Per Debezium's Avro serialization reference, Debezium supports both Apicurio and Confluent Schema Registry as backends. Per Evoila's schema-driven messaging guide, the schema-registry landscape now includes Confluent, Apicurio, AWS Glue, Azure, and Karapace as the main options — vendor-neutrality is finally a meaningful procurement axis where it was Confluent-only a few years ago.
Connections 4
Outbound 4
Resources 3
The authoritative Apache Avro specification defining the schema language (JSON-based), binary encoding, container file format, schema resolution rules, and RPC protocol.
Canonical monorepo containing Avro implementations in Java, Python, C, C++, C#, Ruby, and more, along with the specification source.
Official Apache Avro project homepage and entry point to language-specific getting-started guides, API docs, and community resources.