Definition

What it is

**Fire-Flyer File System** — DeepSeek's high-performance distributed file system purpose-built for AI training and inference, **open-sourced February 2025** at [github.com/deepseek-ai/3fs](https://github.com/deepseek-ai/3fs). Architecturally it's a kernel/userspace hybrid using **NVMe SSDs + RDMA** for the data plane, **CRAQ** (Chain Replication with Apportioned Queries) for strong consistency without a leader bottleneck, and **FoundationDB** for metadata. Published benchmarks: **6.6 TB/s aggregate read throughput** on a 180-node DeepSeek production cluster. Also supports a **KV cache mode** for inference, positioning the same substrate as a cost-effective alternative to DRAM caching for KV-store-heavy LLM serving.

Why it exists

AWS S3 Files (April 2026) closed the POSIX gap for AWS, and JuiceFS does it self-hostable, but DeepSeek's training scale needed something tuned at the protocol layer for **GPU-direct + RDMA** — neither S3-compatible APIs nor traditional HDFS could keep up. 3FS is the storage half of the same vertical bet that produced DeepSeek-V3 + R1: the assumption that **storage architecture is now part of the LLM training problem**, not an independent layer.

Primary use cases

Distributed file system for foundation-model training (datasets, checkpoints, activations), inference-side KV cache for memory-augmented serving, GPU-direct data pipelines requiring sustained TB/s aggregate throughput, AI substrate for clusters where S3-compatible APIs add latency the workload can't tolerate.

Recent developments

Latest signals

Cluster-wide 3FS saturates 400 Gbps storage NIC bandwidth — no internal DRAM cache. Per the arxiv paper Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference (arXiv:2602.21548v2), cluster-wide 3FS has no internal DRAM cache and can saturate the 400 Gbps storage NIC bandwidth under sustained AI-inference workloads. The paper compares 3FS as a storage backend for SGL(MC) and other agentic-inference frameworks; the headline finding is that 3FS's design (NVMe + RDMA, no DRAM hop, CRAQ for consistency) directly maps to what the inference path needs and can keep storage as a non-blocking resource even at extreme GPU concurrency.