DualPipe
Bidirectional pipeline-parallelism algorithm released as part of DeepSeek's Open Source Week Day 4 (February 2025), explicitly designed for the V3/R1 training stack. Overlaps forward and backward computation-communication phases by orchestrating them in bidirectional streams — while one set of micro-batches is doing forward processing, another set is simultaneously running backward. Each chunk is divided into **four components: attention, all-to-all dispatch, MLP, all-to-all combine**, allowing fine-grained overlap with NVLink communication. PyTorch 2.0+ compatible; integrates into existing training pipelines.
Definition
Bidirectional pipeline-parallelism algorithm released as part of DeepSeek's Open Source Week Day 4 (February 2025), explicitly designed for the V3/R1 training stack. Overlaps forward and backward computation-communication phases by orchestrating them in bidirectional streams — while one set of micro-batches is doing forward processing, another set is simultaneously running backward. Each chunk is divided into **four components: attention, all-to-all dispatch, MLP, all-to-all combine**, allowing fine-grained overlap with NVLink communication. PyTorch 2.0+ compatible; integrates into existing training pipelines.
Cross-node expert-parallel MoE training has heavy all-to-all communication overhead — dispatching tokens to experts on remote GPUs, then combining the results, dominates training-step wall-time. Traditional pipeline-parallelism leaves big "pipeline bubbles" while waiting on these all-to-all calls. DualPipe's bet: structurally overlap the communication phases with computation in the *other direction* of the pipeline, so neither GPUs nor NICs sit idle. The result was a key efficiency lever in DeepSeek V3's claimed <$6M training-cost number.
Frontier-MoE training where pipeline bubbles dominate wall-time, large-scale cross-node expert parallelism (V3 uses 16-way PP + 64-way EP + ZeRO-1 DP), distributed training where communication-computation overlap is the gating efficiency factor, and as a reference for any pipeline-parallel training framework adopting bidirectional scheduling.
Recent developments
- Open-sourced Feb 27, 2025 as part of DeepSeek Open Source Week Day 4. Public release of the algorithm + reference implementation. Per MarkTechPost — DualPipe announcement.
- GitHub reference implementation (deepseek-ai/DualPipe). Production-grade PyTorch 2.0+ implementation. Per GitHub (deepseek-ai/DualPipe).
- Combined with EPLB for V3 parallel-strategy upgrade. Released alongside the EPLB (Expert-Parallel Load Balancer) — the two pieces together upgrade DeepSeek's parallel-strategy stack. Per AIBase — Parallel Strategy Upgrade.
- Used in V3 with 16-way PP + 64-way EP + ZeRO-1 DP across 8 nodes. Production deployment specifics: 16-way Pipeline Parallelism, 64-way Expert Parallelism spanning 8 nodes, ZeRO-1 Data Parallelism. Per DeepSeek-V3 Technical Report (arXiv 2412.19437).
- PyTorch Conference Europe 2026 — DualPipe talk by NVIDIA. A featured talk at PyTorch Conference Europe 2026 covered "Optimizing Large MoE Inference on NVIDIA Blackwell: NVFP4, ADP, and DualPipe Strategies" — DualPipe is now a referenced primitive in NVIDIA's MoE inference guidance. Per DeepSeek on X — Day 4 announcement.
Connections 2
Outbound 2
scoped_to1enables1