When the Index Gained 13 Nodes: A Field Guide to the May 2026 Wave

A node added to this index is a deliberate act. Each one represents an answer to a question engineers running real S3 systems are asking — not a topic of interest, not a reference category, but a thing somebody had to make a decision about. The May 7 wave added 13 nodes, and rather than walk through them by category (5 Technologies, 1 Standard, 1 Architecture, 6 Pain Points), this post walks through them by the question each one answers. That's the order they arrive in.

If you read the May 7 post on the DeepSeek storage shift, this is the field guide that sits beside it. That post explained why the index needed expanding. This post is the map for using the new vocabulary.

What does a state-directed S3 stack look like?

While Western hyperscalers compete on egress fees and API compatibility, China is building something structurally different. The East Data West Computing (EDWC) policy — launched in 2022 and now accelerating — treats object storage as national infrastructure rather than a vendor product. Eight national computing hubs span Inner Mongolia, Guizhou, Gansu, and Ningxia in the west, connected by 400G all-optical backbone to eastern demand centers. By 2024, 1.95 million standard racks were operational across these hubs, with a target of 300 EFLOP/s by 2025¹. The strategy is explicit: western hubs, powered by abundant wind and solar, handle AI training and cold storage; eastern hubs near coastal enterprises handle low-latency inference and real-time workloads.

Aliyun OSS is the clearest expression of this AI-native direction. In February 2026, Alibaba Cloud introduced AI Content Perception and Vector Bucket — native vector storage and semantic indexing directly inside the object storage layer, not bolted on top². The Vector Bucket claims hundreds of billions of vectors per account, pay-as-you-go pricing, and integration with Tablestore for high-throughput workloads. For RAG pipelines and AI agent retrieval, this collapses the traditional "S3 → ETL → vector DB" chain into a single API surface. ossfs 2.0, released March 2025, adds POSIX compatibility with FUSE-level caching — a direct parallel to the POSIX-native storage trend tracked in our other coverage. Tencent COS and Huawei OBS follow similar trajectories, integrating native AI primitives at the storage layer.

⚠ Vendor claim to watch: Alibaba reports up to 2 billion vector rows per bucket, but independent latency benchmarks against dedicated vector databases (Pinecone, Milvus) are not yet published. The tradeoff is cost versus query speed — Vector Bucket is positioned as "hundreds of milliseconds" tolerant, which is acceptable for batch RAG but may not serve real-time agent retrieval.

The geopolitical dimension matters. EDWC concentrates AI infrastructure in western provinces that are economically lagging and geographically difficult to disrupt — a resilience strategy Western democracies have struggled to replicate. For engineers building AI workloads in 2026, the four new nodes (Aliyun OSS, Tencent COS, Huawei OBS, East Data West Computing) name a parallel S3 ecosystem that is no longer a regional curiosity. It is a parallel stack with different cost curves, different latency assumptions, and state-level backing that Western vendors cannot match on price alone.

Why did GPUs starve and what's solving it?

For most of the 2010s, "object storage" meant "the cheap layer." Hot training scratch lived on NVMe. S3 was for archived datasets and compliance backups — the cold tier you got data out of before doing real work. That cost model assumed copy time was small relative to compute time.

It isn't anymore. Profiling at Uber, Shopee, and AliPay attributes roughly 80% of end-to-end AI training wall-clock to data loading — the steady-state read throughput from S3 to GPU memory during a synchronous training step. GPUs that cost $40,000 each sit idle below 50% utilization because storage can't feed them fast enough. The new node we added to name this is Data Loading Bottleneck, and it's distinct from both Cold Scan Latency (a first-query problem on analytics) and Legacy Ingestion Bottlenecks (an ETL throughput problem). This one is about the steady-state path from S3 to GPU HBM during training.

The architectural answer is a cache tier between S3 and the compute fleet. Alluxio is the one this wave added — a distributed caching layer that absorbs the first read against S3, retains hot tensors and shard files on local NVMe across the fleet, and replays subsequent reads at near-NVMe speed. Production deployments at Uber, Shopee, and AliPay report ~10× faster GPU data loading versus direct S3 reads. The August 2025 Alluxio Enterprise AI 3.7 release pushed the published numbers further: sub-millisecond TTFB, ~45× lower latency than S3 Standard and ~5× lower than S3 Express One Zone, with 11.5 GiB/s throughput per worker node. The Safetensors-optimized model-loading path reports an 11× speedup (DeepSeek-R1-Distill-Llama-70B: 536s → 49s — 91% of local-disk speed). Customer footprint expanded 50% in H1 2025 with Salesforce, Dyna Robotics, and Geely as named additions.

Pair Alluxio with GPU-Direct Storage Pipeline and NVIDIA GPUDirect RDMA for S3 on the read path, and Meta's published numbers go to 192 GB/s sustained to 2,048 H100 GPUs with GPU utilization in the 80%+ range.

The "cold layer" framing is gone. Object storage is now on the performance-critical path. Engineers designing 2026 systems design their cache topology before they design their training loop.

What replaced MinIO and why was it inevitable?

When MinIO archived its main repo on April 25, 2026 and tightened AGPL v3 enforcement on the remaining open code, the migration that followed wasn't really about performance. The post-MinIO ecosystem we mapped in the May 1 post — RustFS, Alarik, Garage, SeaweedFS — wasn't a flock of teams chasing better benchmarks. It was a flock of teams handling license risk.

The new node added in this wave is AGPL Licensing Risk. The "network use is distribution" clause of AGPL v3 makes self-hosted storage embedded in commercial products legally treacherous to the point that Apache 2.0 alternatives became architecturally mandatory, not just preferable. If your product ships MinIO inside, AGPL exposure is architectural — you can't fix it by editing your config; you have to swap the storage engine.

That's why the post-MinIO migration was inevitable rather than performance-driven. Yes, RustFS published benchmarks showing 2.3× faster small-object performance vs MinIO at 4 KB payloads and peak read throughput up to 323 GB/s. But the legal-team meeting happened first; the benchmarks made the migration easier to defend, not motivated. The "license preference" axis on the East–West infrastructure split tracks the same dynamic — China's storage stack tilts heavily toward Apache/BSD licensing for the same reason, before any performance comparison even starts.

The lesson the new node encodes: license is a first-class architectural constraint for self-hosted storage in 2026, not a footnote.

What are the three data gravity wells?

Object storage architecture is now a geopolitical design decision, not just a technical one. Three incompatible regulatory regimes constrain how AI training data moves, where it lives, and who can access it. Two of them got first-class nodes in this wave; the third was already mapped.

CLOUD Act Data Access — the US side. The Clarifying Lawful Overseas Use of Data Act (2018) lets US law enforcement compel cloud providers to disclose customer data regardless of where the data physically resides. For object storage, this means data in AWS S3, Azure Blob, or Google Cloud Storage is reachable via US legal process — multi-region replication does not escape this if the provider is US-headquartered. Schrems II made the EU position clear: US cloud transfers carrying personal data are legally precarious.

China Data Localization — the PRC side. China's Data Security Law and Personal Information Protection Law require that "important data" and personal information collected in China be stored domestically. Cross-border partnerships have been impacted heavily since 2021. Aliyun OSS, Tencent COS, and Huawei OBS — all newly added in the May 7 wave — are the domestic standards.

The third well — GDPR — was already on the map under existing nodes (Data Residency, Sovereign Storage, Compliance-Aware Architectures). Together, the three regimes mean engineers building multi-region AI infrastructure must architect for three incompatible legal constraints simultaneously. This is not a compliance checkbox. It's a fundamental constraint on how you store, replicate, and access training data.

Storage as AI inference tier — the preview

Most new technologies on this index are reactive — they describe what exists. Wasabi AiR is here as a forward signal. It's a single product launch (Wasabi's AI-augmented object storage tier) that runs facial recognition, speech-to-text, OCR, and logo detection inline as objects are ingested. Bundled at $6.99/TB/month with zero egress, the AI compute is absorbed into the storage subscription rather than billed separately.

Why we added it now, before there's a category of similar products: this is the structural shape we expect to see show up everywhere — object storage that does inference, not just storage. The traditional pattern (pull data out of cold storage, run it through a third-party tagging API, write metadata back) has the AI layer as a separate billed service, with egress costs at every step. Wasabi AiR collapses the pattern into the bucket itself.

The node anchors a future cluster of "in-bucket AI compute" nodes that don't exist yet but will. AWS's S3 Vectors went GA along similar lines. Storj acquired Valdi for the same reason — vertically integrating compute into the storage tier. The seed is in the index now so the future cluster has a place to attach.

The Iceberg V3 sub-spec we extracted

Most table-format engineering material describes Iceberg V3 as a single spec change. It isn't — there's a sub-spec doing most of the load-bearing work. We extracted Puffin File Format as its own Standard node to track this directly.

Puffin is a binary format defined inside the Apache Iceberg specification for storing table-level statistics, indexes, and deletion vectors as auxiliary blobs alongside Parquet data files. A Puffin file is a sequence of typed blobs (each tagged with a blob-type identifier such as deletion-vector-v1, bloom-filter-v1) plus a small footer cataloging blob offsets and types. Iceberg V2 supported only positional and equality delete files — both of which required rewriting one or more Parquet files per modification. At CDC scale, that turned every small change into a multi-gigabyte rewrite.

Iceberg V3 elevates Puffin from auxiliary to load-bearing by storing Roaring-bitmap deletion vectors in Puffin blobs instead of rewriting full Parquet data files for every UPDATE/DELETE. A million-row UPDATE writes a kilobyte-scale Puffin blob instead of regenerating data files — yielding up to 10× faster MERGE/UPDATE in V3 implementations.

This is why the node lives separately from Iceberg V3 Spec on this index: the spec is the contract, but Puffin is the mechanism that makes V3's headline capability work. Engineers writing Iceberg V3-aware tooling will hit Puffin specifically.

Where will the water come from?

The physical constraints on AI infrastructure are tightening faster than the software stack can adapt. Every S3-compatible layer we track — from Tigris Data to MinIO to Aliyun OSS — assumes continued datacenter expansion. But the permits are slowing down.

In Arizona, hyperscalers have announced 26 datacenter projects since 2022. Ceres analysis found that existing Phoenix facilities already consume approximately 385 million gallons of water annually for direct cooling. Once planned facilities come online, that figure is projected to reach 3.7 billion gallons per year — an 870% increase, nearly double the water required for a city the size of Flagstaff³. Two-thirds of datacenters built since 2022 sit in water-stressed regions. A single large facility can evaporate 5 million gallons per day — equivalent to ~16,000 U.S. households. We named this constraint with the Datacenter Water Consumption node.

The response from local governments is hardening. Aurora, Illinois imposed a 180-day moratorium on new datacenter permits to study power-grid impact. In 2026, lawmakers in over 30 U.S. states introduced more than 300 bills addressing datacenter moratoriums, tax incentives, and energy policy⁴. The era of unchecked expansion is ending. The companion node, Datacenter Power Shortfall, tracks the parallel grid constraint — a projected 44 GW US shortfall by 2030 with PJM-region interconnection queues stretching past 8 years.

Tech companies are making defensive pledges — Microsoft committed to ratepayer protection mechanisms, a March 2026 White House "ratepayer protection pledge" was signed by multiple hyperscalers — but consumer advocates note the pledge is nonbinding, and neither the President nor tech companies control utility rate decisions⁵.

For object storage architecture, this creates a data gravity problem with a physical twist. The cheapest compute is increasingly where power and water are abundant — northern Europe, western China (per the East Data West Computing pattern), potentially nuclear-powered sites. But the lowest-latency S3 tiers need to stay close to users. The tension between "store cheap in the west" and "serve fast in the east" is no longer just a network engineering problem. It is a watershed engineering problem.

⚠ Qualifier: Water-usage figures vary dramatically by cooling design. Closed-loop liquid cooling can reduce consumption by ~90% versus traditional evaporative towers. Google's Belgium facility runs on industrial canal water; Microsoft is piloting zero-water designs in Phoenix. The 5 M gallon/day figure applies to legacy evaporative cooling — modern facilities will operate well below this. The constraint is real, but the technology to mitigate it exists. The question is whether permitting and community acceptance will allow deployment at AI-scale.

The index as a forward-looking map

The pattern across all 13 new nodes is the same: each one represents a question engineers are now asking that the index didn't have a place to answer in the prior wave. The China S3 stack is operational reality for half the AI workloads in production but had no nodes. Data Loading Bottleneck is the cost driver of the year but didn't have a name distinct from Cold Scan Latency. AGPL Licensing Risk drove the post-MinIO migration but the migration was happening months before we had a label for the cause. Three Gravity Wells were the actual constraint on cloud architecture but the index only had two of three named.

The wave wasn't about reaching some node-count milestone. It was about catching up to questions the field had already started asking. The index is a forward-looking map by design — when you find yourself drafting a blog post and reaching for a slug that doesn't exist, that's the signal a node is missing. Eight of these thirteen had been signaled in the last 30 days of editorial work.

The next layer is already visible. The "in-bucket AI compute" pattern doesn't have a category yet — Wasabi AiR is its first member. The "POSIX-on-S3" technology cluster has its anchor in Amazon S3 Files but the comparison nodes (Mountpoint, S3FS, Goofys) aren't all on the map yet. The relationship-mapping work between the new China S3 nodes and the existing global providers is ongoing. Each of those is a future wave.

For now: the 13 new nodes are answers to questions that were already being asked. If you build on object storage in 2026, all 13 will land in your design conversations within the next quarter. The map is drawn ahead of where the news cycle will arrive next.

Works cited

Object Storage for AI: Implementing GPU Direct Storage with 200GB/s Throughput — empirical attribution of ~80% of training wall-clock to data loading.
Alluxio AI/ML Acceleration — Uber, Shopee, AliPay GPU data-loading benchmarks.
Accelerating AI With High Performance Storage — 192 GB/s Meta deployment.
What Is RustFS? — 2.3× small-object performance, 323 GB/s peak read, AGPL-vs-Apache-2.0 framing.
Apache Iceberg Puffin Spec — auxiliary-blob format spec.
Working with Apache Iceberg V3 — deletion vectors via Puffin, 10× MERGE/UPDATE.
Introducing Wasabi AiR — feature set, $6.99/TB/month inclusive pricing.

NDRC / Baidu Wiki summary — East Data West Computing 8-hub structure, 1.95M racks (2024), 300 EFLOP/s 2025 target. China datacenter market reported at $38.57B (2025) by IMARC Group with 9.16% CAGR through 2034. ↩
Alibaba Cloud Community — Vector Bucket public-beta announcement (Feb 2026) — "AI Content Perception" + Vector Bucket multimodal indexing inside OSS. ossfs 2.0 POSIX FUSE March 2025 per Alibaba Cloud documentation. ↩
Ceres analysis cited via Consumer Reports — Phoenix datacenter water-consumption projection 385 M → 3.7 B gal/year. Bloomberg investigation: ~⅔ of post-2022 datacenters in water-stressed regions. EPA estimates: ~5 M gal/day evaporative load per large facility. ↩
Multistate.us 2026 datacenter-policy tracker — 30+ U.S. states, 300+ bills covering moratoria, tax incentives, energy policy. Aurora, IL 180-day moratorium per local government records. ↩
White House March 2026 "ratepayer protection pledge" — analyzed by Consumer Reports and Harvard Law as nonbinding given utility rate decisions sit with state PUCs, not federal. ↩