For two years, the conversation about object storage was about cost. Egress charges, storage classes, lifecycle rules, the small-files problem. Storage was a budget item.
Then in January 2025, DeepSeek shipped R1 for $5.6 million total1 and Nvidia lost $589 billion in market cap in a day. The market priced in the fact that algorithmic efficiency had outpaced hardware scaling — and quietly, underneath that headline, the storage layer became the critical path. GPUs that cost $40,000 each were sitting idle 50% of the time waiting on data, and the fix was no longer "buy more storage." The fix was "rethink the data plane."
This post is a reading of what changed in the storage layer over the last six months, why the China direction is now the first-class axis of the index, and which 13 new nodes were added in the May 7 wave to track it.
The pain point that became load-bearing
Cold Scan Latency has been a pain point in this index since launch. So has High Cloud Inference Cost. Both were framed for analytics workloads — slow first queries, expensive embedding generation. Neither was framed as the dominant cost driver in training.
That changed. Profiling at Uber, Shopee, and AliPay attributes ~80% of end-to-end AI training wall-clock to data loading2 — the steady-state read throughput from S3 to GPU HBM during a synchronous training step. Distinct from cold-scan latency, distinct from ETL throughput, this is its own pain point, and it's the one foundation-model labs are spending engineering quarters fixing. We've added Data Loading Bottleneck as a first-class node in this update.
The architectural answer was already half-built in the index. GPU-Direct Storage Pipeline and NVIDIA GPUDirect RDMA for S3 were on the map. What we hadn't tracked was the cache tier between S3 and the GPU fleet — the layer that absorbs the first read, retains hot tensors and shard files on local NVMe, and replays subsequent reads at near-local-disk speed. That's Alluxio, now a node. Public case studies report 10× faster GPU data loading versus direct S3 reads3, and Meta reported sustaining 192 GB/s to 2,048 H100 GPUs simultaneously when GPUDirect Storage 2.0 went into production — cutting end-to-end training wall-clock by 3.8× versus the prior CPU-mediated path4.
Tiered Storage was already a node, but its definition was about cost optimization. We've enriched it to capture the AI training tier reference shape — Netflix runs 5 PB NVMe → 100 PB HDD → 500 PB S3 → 2 EB tape, with NVMe holding 1–2% of capacity but absorbing 60% of requests and tape holding 10–20% of capacity but seeing under 1% of reads. That four-tier shape saves a reported $45M/year over a flat all-S3 layout. Storage tiering used to be operations. Now it's training-cluster scheduling.
The 2026 architecture map
By early 2026, every major frontier model uses some combination of DeepSeek's innovations. The exact mix reveals strategic bets about which efficiency lever matters most.
| Model | Params | Active | MLA? | MoE? | FP8? | Region |
|---|---|---|---|---|---|---|
| DeepSeek-V3 | 671B | 37B | ✅ | ✅ (257 experts) | ✅ | 🇨🇳 China |
| GLM-5 | 744B | 40B | ✅ | ✅ (256 experts) | ? | 🇨🇳 China |
| Kimi K2 | ~1T | ? | ✅ | ✅ | ? | 🇨🇳 China |
| Mistral 3 Large | ? | ? | ✅ | ✅ | ? | 🇫🇷 EU |
| Llama 4 Maverick | 400B | 17B | ❌ (GQA) | ✅ (2 active) | ✅ | 🇺🇸 US |
| Qwen3 | 235B | 22B | ❌ (GQA) | ✅ (MoE variants) | ? | 🇨🇳 China |
Three strategic groups have emerged.
China labs — DeepSeek, Zhipu (GLM), Moonshot (Kimi) — adopted Multi-Head Latent Attention plus fine-grained Mixture-of-Experts: many small experts, each highly specialized. The parameter efficiency is extreme: GLM-5 activates only 5.4% of its 744B weights per token. The bet is that routing precision beats expert size.
Meta (US) took the opposite path with Llama 4 Maverick: GQA plus coarse MoE — only two active experts out of sixteen total. Fewer, larger experts mean simpler routing and better compatibility with existing GQA inference stacks. Meta traded absolute efficiency for deployability.
Alibaba (Qwen3) chose a hybrid: Gated DeltaNet linear attention plus GQA, deliberately skipping MLA. DeltaNet uses a gating mechanism to compress historical states without the full KV cache, aiming for long-context efficiency without the MLA infrastructure rewrite.
⚠ Note on disputed values: GLM-5 exact counts (744B/40B) are cited from industry coverage; Zhipu AI has not published official parameter breakdowns. Kimi K2's "~1T" figure is an estimate — Moonshot AI has not confirmed exact total or active parameters.
The divergence is not academic. It determines which inference engine you buy, which storage tier you need, and whether you can migrate an existing GQA fleet without retraining.
Sources: DeepSeek-V3 arXiv:2412.19437; GLM-5 industry coverage (Feb 2026); Llama 4 technical release; Qwen3 architecture paper; Mistral 3 adoption notes; Sebastian Raschka architecture comparison.
FP8: the new default precision
In 2024, BF16 was the training standard and FP8 was a research curiosity. By 2026, FP8 is becoming the de facto baseline for new model training — and the shift is driven by hardware economics, not just DeepSeek's proof point.
The quantitative case is simple: FP8 halves compute and memory requirements versus BF16. On NVIDIA Hopper and Blackwell, this translates to up to 4× training throughput and up to 6× inference throughput. The datacenter TCO advantage compounds across model iterations, checkpoint storage, and serving fleet size.
Hardware support (confirmed production):
- NVIDIA Hopper (H100, H200, H800) — native FP8 tensor cores
- NVIDIA Blackwell (B100, B200, RTX 5000 series) — FP8-optimized paths
- Intel Gaudi — HPU with FP8 support
- AMD MI300X+ — adding FP8 support in current generation
The challenge is accumulation precision. Hopper's FP8 tensor cores maintain only 13 fraction bits for addition — insufficient for the large matrix accumulations in transformer training. DeepSeek solved this with fine-grained quantization: tile-wise 1×128 scaling for activations, block-wise 128×128 scaling for weights. The implementation is open-sourced as DeepGEMM.
Llama 4 followed DeepSeek-V3 into native FP8 training. The direction is now clear: BF16-only hardware improvements (H200 → B300) show diminishing returns for BF16 efficiency. The next leap requires precision reduction.
For object storage, this matters because FP8-trained models produce smaller checkpoints (half the size of BF16 equivalents) and enable larger batch sizes during data loading — increasing the I/O pressure on the storage tier that feeds GPUs.
Sources: FP8 Datacenter TCO arXiv:2502.01070; DeepGEMM GitHub; NVIDIA Hopper architecture documentation; Llama 4 training precision disclosure.
The inference engine divergence
You cannot serve MLA and GQA models on the same optimized inference stack. This is an operational truth that most architecture comparisons ignore, and it changes how you design data-center infrastructure.
Multi-Head Latent Attention compresses the KV cache into a low-rank latent vector — roughly 512 dimensions — before any caching happens. This achieves the advertised 93% reduction, but it also changes the memory layout in ways that break standard inference-engine assumptions. The table below shows where the stacks diverge:
| Feature | GQA Models | MLA Models |
|---|---|---|
| KV cache offloading | Supported (to CPU/SSD) | Not supported |
| Block size | Larger blocks = higher throughput | Block size = 1 (required) |
| FlashAttention | Native kernels | Custom kernels needed |
| vLLM support | Native PagedAttention | Requires plugin or SGLang |
| Quantization | Standard INT8/FP8 KV quant | Needs MLA-aware quant (SnapMLA) |
The KV-offloading incompatibility is the sharpest constraint. GQA models can spill KV cache to CPU memory or NVMe when context grows; MLA models cannot, because the compressed latent representation is tightly coupled to the attention kernel. If you plan to serve 128K+ context on MLA, you must fit the entire KV working set in GPU HBM — and that lands directly on Data Loading Bottleneck when the working set spills past available HBM into a streaming-reload pattern.
Block size is equally rigid. vLLM's PagedAttention assumes larger blocks for contiguous memory access. MLA forces block size to 1, which hurts throughput unless the inference engine was built for it. SGLang is currently the only production engine with native MLA optimization.
TransMLA is the migration bridge. Introduced at NeurIPS 2025 (Spotlight), it converts any GQA model — Llama, Qwen, Gemma, Mistral — into MLA with just 6B tokens of fine-tuning. On Llama-2-7B at 8K context it yields a 10× speedup. But it is not yet upstreamed into vLLM. Until that lands, data centers must choose: build an MLA-native stack, or stay on the GQA stack and accept the efficiency ceiling.
Sources: MLA paper arXiv:2405.04434; TransMLA NeurIPS 2025; SGLang MLA documentation; vLLM plugin status (April 2026); SnapMLA arXiv:2602.10718.
Why the China direction is now the first-class axis
Three forces converged. (1) US export controls cut Chinese labs off from H100/H200/Blackwell silicon, forcing them to compensate with raw scale and cheap power — which only the western Chinese provinces have. (2) The CLOUD Act (CLOUD Act Data Access, now a node) made every US-headquartered cloud provider a sovereign-disclosure risk for non-US data, regardless of which region the bucket physically lived in. (3) China Data Localization (now a node too) closed the loop on the other side: PRC-citizen data legally cannot egress without per-dataset Cyberspace Administration of China review.
What sits in the middle of all three forces is object storage. Not analytics. Not vector databases. Not the inference engines. The storage tier is the layer that has to physically be in a particular jurisdiction, on a particular silicon family, owned by a particular entity. Everything else can move; storage cannot.
That's why we've added Aliyun OSS, Tencent COS, and Huawei OBS as first-class technology nodes in this update. These are not regional curiosities — they're the silent majority of foundation-model training storage in 2026. DeepSeek, Zhipu AI (GLM-5, trained on 100,000 Huawei Ascend 910B chips with OBS as the storage tier5), Moonshot AI, and Alibaba's own Qwen team — none of them are training against AWS S3. They're training against the China S3 trio. The index was missing this entire half of the storage market until this expansion.
We've also added East Data West Computing as an architecture node — the placement strategy that connects the China S3 trio's coastal hot tiers to the western-province training clusters drawing on a projected ~400 GW of spare grid capacity by 20306 at electricity rates as low as 3¢/kWh. This is the macro shape that explains why the Chinese cloud providers cluster their regions where they do, and why the US-side equivalent doesn't exist: in the US, a projected ~44 GW power shortfall by 20307 and 8+ year interconnection queues are actively blocking the same shape from being built. We've added Datacenter Power Shortfall and Datacenter Water Consumption as pain points to capture the US-side constraint.
Three data gravity wells
Object storage is no longer just an engineering choice. It is a geopolitical design decision. Three incompatible regulatory regimes now constrain how AI training data moves, where it lives, and who can access it.
United States — CLOUD Act. US law enforcement can compel cloud providers to hand over data stored on American companies' servers, regardless of where that data physically resides. For object storage, this means any data in AWS S3, Azure Blob, or Google Cloud Storage is reachable via US legal process. Multi-region replication does not escape this — if the provider is US-headquartered, the CLOUD Act applies globally.
European Union — GDPR. Cross-border data transfers require adequacy decisions or standard contractual clauses. AI training datasets containing personal data face strict purpose-limitation and data-minimization rules. The Schrems II ruling invalidated Privacy Shield, making US cloud transfers legally precarious. Object storage architects in the EU must design for data residency by default.
China — Data Localization. China's Data Security Law and Personal Information Protection Law require that "important data" and personal information collected in China be stored domestically. An estimated 78% of cross-border data partnerships have been impacted by localization requirements since 2021. Aliyun OSS, Tencent COS, and Huawei OBS are the domestic standards — not AWS S3 API compatible in all edge cases.
The East–West infrastructure split amplifies the storage tension:
| Dimension | 🇺🇸 West (US-led) | 🇨🇳 East (China-led) |
|---|---|---|
| AI capital | $67.2B | $43.8B |
| GPU access | H100/H200/Blackwell unrestricted | H800 (cut-down), export controls |
| Energy | 44 GW projected shortfall | 400 GW spare by 2030 |
| Electricity cost | Higher | ~4–5× lower (western provinces) |
| S3 standard | AWS S3 API | Aliyun OSS / Tencent COS / Huawei OBS |
| License preference | AGPL controversy (MinIO → RustFS) | Apache/BSD preferred |
DeepSeek trained on H800s with reduced interconnect bandwidth, then compensated with DualPipe, custom all-to-all kernels, and FP8 — proving that algorithmic efficiency can beat hardware restrictions. The $5.6M training cost was not a subsidy. It was structural efficiency. (SemiAnalysis disputes the marginal-only framing and estimates $12–15M for the full R&D cost; either figure is an order of magnitude below comparable Western training runs.) And it sent NVIDIA's market cap down $589 billion in a single trading day.
Engineers building multi-region AI infrastructure must now optimize for three gravity wells simultaneously. This is not a compliance checkbox — it is a fundamental constraint on storage architecture, and it's the reason Sovereign Storage sits where it does in this index.
Sources: CLOUD Act text and DOJ guidance; GDPR Chapter V (transfers); China Data Security Law (2021); SemiAnalysis NVIDIA market-cap analysis (Jan 2025); DeepSeek-V3 training cost arXiv:2412.19437.
The licensing axis the index didn't have a name for
When MinIO archived its main repo on April 25, 2026 and tightened AGPL v3 enforcement on the remaining open code, the post-MinIO ecosystem we mapped in the last wave — RustFS, Alarik, Garage, SeaweedFS — wasn't just a competing-storage story. It was a licensing migration, driven by a specific architectural exposure that the index didn't have a node for.
We do now: AGPL Licensing Risk. The "network use is distribution" clause of AGPL v3 makes self-hosted storage embedded in commercial products legally treacherous to the point that Apache 2.0 alternatives became architecturally mandatory, not just preferable. The "license preference" axis on the East–West infrastructure split tracks the same way — China's storage stack tilts heavily toward Apache/BSD licensing for the same reason. RustFS published benchmarks showing 2.3× faster small-object performance vs MinIO for 4 KB payloads and peak read throughput up to 323 GB/s8, and we've enriched the RustFS node with those numbers along with the AGPL-vs-Apache-2.0 framing.
Storage as the AI inference tier
The last new technology node is Wasabi AiR. On its surface it's a media-storage product with auto-tagging — facial recognition, speech-to-text, OCR, logo detection — bundled at $6.99/TB/month with zero egress9. Underneath, it's an early example of the storage layer becoming an AI inference tier. The traditional pattern (pull data out of cold storage, run it through a third-party tagging API, write metadata back) gets collapsed into the bucket itself. No egress to the tagging service, no metadata pipeline glue, no per-call AI billing.
Wasabi AiR is small in absolute scale, but it's the structural shape we expect to see show up everywhere: object storage that does inference, not just storage. It anchors a future cluster of nodes around in-bucket AI compute that doesn't exist yet — but the seed is now in the index.
The Iceberg V3 sub-spec we extracted
Last on the technical side: Puffin File Format is now its own Standard node, separate from Iceberg V3 Spec. Puffin had been folded into Iceberg V3's body text, but it's the load-bearing format underneath the deletion vector capability that turns Iceberg V3 from "Iceberg V2 plus features" into something with order-of-magnitude-faster MERGE/UPDATE. Roaring-bitmap deletes encoded as Puffin blobs replace the copy-on-write file rewrites that made V2 economically punishing for CDC. We also enriched Amazon S3 Tables with the April 2026 Intelligent-Tiering integration — automatic cold-data migration on table buckets without disrupting compaction.
What the index looks like after this wave
229 → 242 nodes. The shape of the delta:
- 5 new technologies — Aliyun OSS, Tencent COS, Huawei OBS, Wasabi AiR, Alluxio
- 1 new architecture — East Data West Computing
- 6 new pain points — Data Loading Bottleneck, AGPL Licensing Risk, CLOUD Act Data Access, China Data Localization, Datacenter Power Shortfall, Datacenter Water Consumption
- 1 new standard — Puffin File Format
- 9 enriched nodes — Amazon S3 Files, Amazon S3 Tables, Iceberg V3 Spec, NVIDIA GPUDirect RDMA for S3, GPU-Direct Storage Pipeline, RustFS, Wasabi, Tiered Storage, MinIO
No new relationship verbs. The existing 12 — scoped_to, implements, solves, constrained_by, enables, depends_on, accelerates, bypasses, and the rest — covered every edge in this wave.
What engineers should do now
The architecture shifts above are not future speculation. They are live decisions that infrastructure teams are making this quarter. Three lenses on the same constraint set, depending on where you sit.
If you are choosing a model architecture
MLA plus MoE is the 2026 default for any frontier model above 100B parameters. GQA is not wrong — it is legacy. The question is whether your use case justifies the infrastructure investment to support MLA's stricter deployment requirements. If you are training from scratch, use FP8 from the first forward pass; retrofitting precision is harder than starting with it. And if you are serving multiple model families, your inference stack must be attention-aware — you cannot optimize for both MLA and GQA simultaneously without accepting tradeoffs on at least one.
If you are building infrastructure
Tier your storage. The Netflix model is the reference: 1–2% of capacity on NVMe for active training scratch, 20–30% on HDD for recent checkpoints, 50–60% in S3 for training datasets, 10–20% in archive for model weights. Single-tier storage is leaving money on the table — Netflix saves a reported $45M annually against a single-tier design.
Add a caching layer between S3 and compute. Raw S3 is too slow for GPU feeding. Alluxio, JuiceFS, or a similar distributed cache can recover the 80% of training time currently lost to data loading. Evaluate S3 Express One Zone for workloads that need single-digit-millisecond latency with S3 semantics. And design for multi-region from day one — data sovereignty is not a feature you add later; it is a constraint that shapes your replication topology.
If you are watching the market
Watch TransMLA for upstream vLLM support. When it lands, the migration path from GQA to MLA becomes viable for existing fleets without retraining from scratch. Watch TurboQuant — already merged into vLLM (April 2026) — for production-grade KV compression on GQA models. Watch Routing-Free MoE (April 2026 research) as the next paradigm after DeepSeek's shared-plus-routed expert design; it eliminates the router bottleneck entirely. And watch IndexCache for 200K-plus context applications — the prefill latency problem is being solved without retraining or extra GPU memory.
The interesting pattern from this wave is that the index already had the vocabulary for what changed. Sovereign Storage and Data Residency were nodes. Vendor Lock-In was a node. Cold Scan Latency was a node. The DeepSeek + China direction made those nodes load-bearing in a way they hadn't been when we mapped them. The 13 new nodes are the things the news cycle made visible enough that they earned their own surface in the graph.
Additional sources for this section: Netflix tiered-storage engineering blog; Alluxio GPU caching benchmarks; S3 Express One Zone latency specifications; TransMLA NeurIPS 2025; TurboQuant vLLM merge (April 2026); Routing-Free MoE research (April 2026); IndexCache arXiv:2603.12201.
Works cited
Footnotes
-
DeepSeek-V3 Technical Report — primary source for the $5.6M figure and the H800 hour count. ↩
-
Object Storage for AI: Implementing GPU Direct Storage with 200GB/s Throughput — empirical attribution of ~80% of training wall-clock to data loading. ↩
-
Alluxio AI/ML Acceleration — Uber, Shopee, AliPay GPU data-loading benchmarks. ↩
-
Accelerating AI With High Performance Storage — 192 GB/s sustained Meta deployment, 3.8× training speedup attribution. ↩
-
How China's GLM-5 Works: 744B Model on Huawei Chips — GLM-5 training on 100,000 Ascend 910B chips with OBS as storage substrate. ↩
-
China's data center capacity set to top 60 GW by 2030, driving a doubling of power demand — Rystad Energy projection of ~400 GW spare capacity. ↩
-
Powering the US Data Center Boom — World Resources Institute, 44 GW shortfall projection. ↩
-
What Is RustFS? Apache 2.0 MinIO Alternative (2026) — 2.3× small-object benchmark, 323 GB/s peak read. ↩
-
Introducing Wasabi AiR — feature set, $6.99/TB/month inclusive pricing, zero-egress positioning. ↩