Architecture

DeepGEMM

Clean, FP8-first GEMM (general matrix multiplication) library from DeepSeek, hand-tuned on top of NVIDIA CuTe and CUTLASS primitives. Targets a small number of well-chosen FP8 GEMM and MoE-shape kernels that DeepSeek's models actually use, packaged behind a JIT-compiled Python API. The April 2026 release (PR #304, "Public release 26/04") marks DeepGEMM's evolution from a static kernel library into a runtime — adding **Mega MoE** (fused dispatch + linear1 + SwiGLU + linear2 + combine into one mega-kernel), **FP4 Indexer** (for MQA logits + larger MTP), and **FP8×FP4 GEMM**.

2 connections

Definition

What it is

Clean, FP8-first GEMM (general matrix multiplication) library from DeepSeek, hand-tuned on top of NVIDIA CuTe and CUTLASS primitives. Targets a small number of well-chosen FP8 GEMM and MoE-shape kernels that DeepSeek's models actually use, packaged behind a JIT-compiled Python API. The April 2026 release (PR #304, "Public release 26/04") marks DeepGEMM's evolution from a static kernel library into a runtime — adding **Mega MoE** (fused dispatch + linear1 + SwiGLU + linear2 + combine into one mega-kernel), **FP4 Indexer** (for MQA logits + larger MTP), and **FP8×FP4 GEMM**.

Why it exists

General-purpose CUDA libraries (cuBLAS, even CUTLASS reference paths) under-utilize tensor cores for the specific kernel shapes that show up in MoE inference — particularly the mixed FP8×FP4 paths that DeepSeek's models need post-FP8-training. DeepGEMM is the bet that hand-tuning a narrow set of kernels for the exact shapes the models use beats general libraries by enough to matter in production serving cost. Mega MoE in particular fuses what was previously 5 sequential CUDA kernels (with all their launch overhead + NVLink serialization) into a single mega-kernel that overlaps communication and compute.

Primary use cases

Production serving of DeepSeek V3 / R1 / V3.2 inference, MoE inference workloads with FP8 weight + FP4 indexer paths, RTX Pro 6000 Blackwell deployments (community is actively patching SM120 support), MQA logits computation for V3.2's lightning indexer, and any inference stack where FP8×FP4 GEMM kernels are on the hot path.

Recent developments

Latest signals
  • April 2026 release — Mega MoE + FP4 Indexer + FP8×FP4 GEMM (PR #304). Marks transition from kernel library → runtime. Fuses dispatch / linear1 / SwiGLU / linear2 / combine into one mega-kernel, overlapping NVLink and tensor core. Requires PyTorch ≥ 2.9. Per antigravity.codes — DeepGEMM guide.
  • Hand-tuned on CuTe + CUTLASS primitives. Small CUDA library taking a few well-chosen FP8 GEMM/MoE shapes seriously, packaged behind a JIT-compiled Python API. Per GitHub (deepseek-ai/DeepGEMM).
  • Community SM120 support for RTX Pro 6000 Blackwell. The most active community discussion is SM120 support — multiple users running manual-patched kernels on RTX Pro 6000 Blackwell. Per antigravity.codes — DeepGEMM guide.
  • FP4 Indexer for MQA logits + larger MTP. Per the April 2026 release notes, FP4 Indexer supports Multi-Query Attention logits computation and enables larger Multi-Token Prediction in V3.2 deployments. Per AIToolly — DeepGEMM launch.
  • DeepEPv2 MoE GEMM layout shipped in the April release. Updated MoE GEMM layout matching DeepEPv2's communication topology. Per PyShine — DeepGEMM FP8 kernels.

Connections 2

Outbound 2