From Flink to Bytewax: The Python-Native Shift in S3 Data Ingestion

The Mismatch

The dominant language for building AI applications in 2026 is Python. The dominant framework for streaming data into S3 lakehouses is Apache Flink — a JVM-based distributed system requiring Java or Scala expertise, cluster management, and memory footprints measured in tens of gigabytes.

This creates a practical mismatch. A team of three ML engineers building a RAG pipeline — ingesting documents from S3, generating embeddings with HuggingFace, writing vectors to LanceDB — should not need to hire a JVM specialist to operate Flink. But until recently, there was no production-viable alternative for streaming data into Apache Iceberg tables on S3.

Bytewax is the clearest challenge to that status quo. Built on a Rust-based Timely Dataflow engine with a pure Python API, it provides streaming semantics — windowing, sessions, stateful transformations — without JVM infrastructure. Benchmarks show 25x less memory consumption than Flink for comparable workloads.1 The question is what you give up.

What Bytewax Actually Is

Bytewax is not a port of Flink to Python. It is a different architecture.

Flink is a distributed stream processor. It runs on a cluster of JVM worker nodes, shards state across them, and provides exactly-once processing guarantees through distributed checkpointing and two-phase commit. It can process millions of events per second across terabytes of state.2

Bytewax is a single-node dataflow engine. The Python API defines a directed graph of operators — map, filter, window, reduce — and the Rust engine executes them with thread-level parallelism on a single machine. State lives in-process. There is no distributed coordination, no cluster manager, no Zookeeper or KRaft.3

This is a feature, not a limitation — for the workloads Bytewax targets. Most Python AI/ML teams are not processing petabytes of CDC events. They are ingesting thousands to tens of thousands of documents per second, generating embeddings, and writing to S3. Bytewax handles this at a fraction of the resource cost.

The 25x Memory Claim

Bytewax's headline benchmark is striking: 4.1 GB total memory consumption for a workload that pushes Flink to 97.9 GB.1 The 25x difference is real but context-dependent.

Where it holds: Moderate-throughput streaming workloads — CDC event processing, real-time embedding generation, log transformation, lightweight ETL. Bytewax's Rust engine manages memory allocation tightly, and the Python operators process events without the garbage collection pauses that plague JVM-based systems under memory pressure.

Where it breaks: The comparison assumes equivalent single-node workloads. Flink's memory overhead is partly the cost of distributed coordination — state backends, checkpoint storage, network buffers for cross-node shuffling. If you actually need distributed processing (because your throughput exceeds what one machine handles), Flink's overhead is the price of distributed exactly-once semantics. Bytewax cannot distribute — scaling up means a bigger machine, not more machines.

On cloud infrastructure, the 25x memory gap translates to roughly 4x lower compute costs — a meaningful difference for teams running ingestion pipelines 24/7 on EC2 or equivalent.1

The Real-World Pipeline

A concrete CDC pipeline illustrates where Bytewax fits:

  1. Debezium captures row-level changes from a PostgreSQL WAL — inserts, updates, deletes — and publishes them as events to Redpanda.

  2. Bytewax consumes from the Redpanda topic. A Python dataflow applies transformations: data masking for PII fields, currency conversion, timestamp normalization. For AI-relevant fields (product descriptions, support tickets), the dataflow calls a local embedding model to generate vectors.

  3. S3 write. Bytewax micro-batches the transformed events and writes them as Parquet files to S3. A separate commit step appends the new files to an Iceberg table's metadata manifest.

  4. Apache Airflow runs a scheduled DAG that triggers Iceberg compaction — merging the small Parquet files produced by micro-batch writes into optimally-sized 128-512MB files.

This pipeline handles the ingestion volume of a typical mid-scale application (thousands of events/second) on a single node with ~4GB of memory. The equivalent Flink pipeline would deliver exactly-once guarantees via its Dynamic Iceberg Sink — stronger consistency, but requiring a JVM cluster and an order of magnitude more resources.4

Exactly-Once: The Gap That Matters

The most important technical difference between Bytewax and Flink for lakehouse ingestion is transactional guarantees.

Flink's Iceberg sink provides exactly-once semantics through two-phase commit. When a Flink checkpoint completes, all Parquet files written since the last checkpoint are atomically committed to the Iceberg manifest. If the pipeline crashes mid-checkpoint, incomplete files are discarded and processing resumes from the last committed offset. No duplicates. No data loss.2

Bytewax does not have a native Iceberg sink with two-phase commit. Writing to Iceberg from Bytewax requires application-level logic: micro-batch events into Parquet files, write them to S3, then commit to the Iceberg catalog. If the process crashes between the S3 write and the catalog commit, you get orphaned files. If it crashes after the catalog commit but before acknowledging the Redpanda offset, you get duplicate processing on restart.

For many workloads — document ingestion, embedding generation, log analytics — duplicates are tolerable and idempotency is achievable at the application level. For financial transactions, compliance-critical CDC, and workloads where every event must be processed exactly once, Flink's guarantees are non-negotiable.

Airflow, Bytewax, and Flink: Different Tools for Different Problems

These three tools are often mentioned together but solve different problems:

Apache Airflow is a batch orchestrator. It defines DAGs of tasks that run on a schedule — hourly ETL jobs, daily compaction, weekly data quality checks. Airflow does not process streaming data. It triggers and monitors the tools that do.

Bytewax is a streaming processor for moderate-throughput, Python-native workloads. It handles continuous event streams — CDC, real-time embeddings, log transformation — on a single node.

Flink is a distributed streaming processor for high-throughput workloads with strong consistency requirements. It handles millions of events per second across a cluster with exactly-once semantics.

Many production architectures use all three:

  • Flink or Bytewax for continuous streaming ingestion into S3
  • Airflow for scheduled batch operations (compaction, reindexing, data quality)
  • Airflow triggering Flink jobs for complex distributed streaming tasks

The mistake is treating them as alternatives when they are complementary layers.

When to Choose Bytewax

Your team is Python-first. The engineers building the pipeline are ML engineers, data scientists, or Python backend developers. They know pandas, HuggingFace, and FastAPI. They do not know Java, Maven, or JVM garbage collection tuning.

Your throughput is moderate. Thousands to low tens of thousands of events per second. Document ingestion, embedding generation, feature extraction. Not millions-of-events-per-second financial tick data.

Your infrastructure budget is constrained. You are running on a single server, a small VPS cluster, or edge hardware. Bytewax runs in ~4GB of memory. Flink's minimum viable cluster requires 10-20x more.

Duplicates are tolerable. Your workload can handle idempotent reprocessing — embedding the same document twice produces the same vector, inserting the same log entry twice is deduplicated downstream.

When to Stay with Flink

You need exactly-once delivery to Iceberg. Financial data, compliance CDC, audit trails — any workload where a duplicate or lost event has real consequences.

Your throughput requires distribution. If a single machine cannot handle your event volume, Flink's distributed execution is the only open-source streaming option with mature S3/Iceberg integration.

Your organization already operates Flink. If you have a platform team that manages Flink clusters, tuned connectors, and established monitoring — the operational cost is already paid. Switching to Bytewax gains memory efficiency but loses ecosystem maturity.

You need the Flink connector ecosystem. Flink has mature connectors for Kafka, Kinesis, JDBC, Elasticsearch, and dozens of other systems. Bytewax's connector ecosystem is smaller and younger.

The Trend Line

The direction is clear even if the destination is not. Python's dominance in AI/ML engineering is pulling the streaming ecosystem toward Python-native tools. Bytewax is the furthest along, but it is not alone — the broader trend includes dlt for Python-native data loading and Polars for Python-native analytical processing.

The JVM is not going away. Flink will remain the standard for enterprise-scale streaming with strong consistency guarantees. But for the growing population of teams that need to get data from operational systems into S3 lakehouses and vector indexes — and whose engineers think in Python — the JVM tax is no longer mandatory.


Works cited

  1. Going Head-to-Head Against Flink: Benchmarking Ease of Use and TCO — Bytewax vs Flink operational comparison
  2. Apache Airflow Documentation — DAG orchestration for batch pipelines

Footnotes

  1. How Bytewax Beats Flink in Efficiency, Cost, and Ease of Use — Memory and cost benchmarks 2 3

  2. CDC Strategies in Apache Iceberg — Flink Iceberg sink exactly-once semantics 2

  3. Bytewax Documentation — Dataflow API and architecture

  4. The Rise of the Streaming Data Lakehouse — Streaming lakehouse architecture patterns