The POSIX Gap is Closing: How S3 Quietly Became a File System

For two decades, S3 was an object store and only an object store. You could PUT, you could GET, you could DELETE. You could not MV. The thing the filesystem world took for granted — atomic rename — simply did not exist on object storage, and every serious lakehouse engineer eventually crashed into that gap.

The April 2026 announcement of Amazon S3 Files is the moment the gap quietly closed. Not because Amazon found a clever client-side workaround. Because S3 itself, after a series of incremental changes from 2020 to 2025, finally absorbed the operations that file systems require. The fix came from the server side, not the client.

This post is the second half of a two-post arc. The May 7 post was about why object storage suddenly matters for AI workloads — GPU starvation, the data-loading bottleneck, the cost of cold-tier reads in the hot path. This post is about how the access model is evolving in response. The China-side parallel — Aliyun CPFS+OSS, Huawei OBS+MindSpore — gets a sidebar because the same shape is being drawn three different ways at the same time.

The pain we couldn't name (until we named it)

Lack of Atomic Rename has been a first-class pain point in this index since launch. It's why Apache Iceberg, Delta Lake, and Apache Hudi exist — every modern table format is, at root, a workaround for the missing primitive. Rename on S3 is a copy followed by a delete, and the window between those two operations has cost engineers careers' worth of correctness bugs.

Outside the lakehouse layer, the constraint hits even harder. Training Data Streaming from Object Storage is GPU-bound at the read side, but every checkpoint a training framework writes is write-bound at the consistency side. PyTorch's tempfile.NamedTemporaryFile followed by os.rename(tmp, final) — the standard atomic-write pattern in Python — silently broke on S3 for years. Every framework that committed work via rename-as-swap had to be re-engineered for S3 semantics, or run against a POSIX layer that didn't exist yet.

The cost surfaced empirically. Profiling at Uber, Shopee, and AliPay attributed roughly 80% of end-to-end AI training wall-clock to data loading1 — the steady-state read throughput from S3 to GPU memory during a synchronous training step. We mapped that as Data Loading Bottleneck in the May 7 post. The dual-side pressure — slow reads, broken writes — is what made the 2010s POSIX-on-S3 attempts inevitable, and inevitable to fail.

The 2010s attempts and why they broke

The POSIX-on-S3 problem is not new. Engineers have been trying to bridge the gap for over a decade. The failures of the 2010s teach us what not to do, and why the current wave is different.

S3FS-FUSE was the first serious attempt: a FUSE-based filesystem that translated POSIX calls to S3 API operations. It worked for single-user file browsing — and broke under any production load. The fundamental mismatch is semantic: POSIX assumes atomic rename, directory listings, and in-place modification. S3 offers eventual consistency (pre-2020), full-object PUT, and no native rename operation.

At Uber, S3FS-FUSE became a textbook case of what happens when POSIX semantics are forced onto an object API. Engineers observed a substantial fraction of data-loading time spent on LIST operations — S3's directory-listing equivalent — because S3FS had to enumerate prefixes to simulate ls. Each LIST is a network round-trip, and training pipelines do thousands of them per epoch. The costs were predictable: Uber's S3 bill included tens of thousands of dollars per month in unexpected LIST charges alone, before data transfer or storage fees2.

Goofys tried a lighter approach: skip FUSE metadata emulation and treat S3 as a blob store with caching. It was faster than S3FS for sequential reads but introduced its own failure modes. Race conditions on concurrent writes were common — Goofys's write-back cache could not guarantee consistency across multiple clients. Write amplification was severe: every small write triggered a full-object PUT, turning 4KB random writes into 100MB+ object overwrites.

The 2010s POSIX-on-S3 tools shared three fatal flaws:

  1. Consistency gap. Before S3 strong consistency (Dec 2020), a read-after-write could return stale data — catastrophic for checkpoint recovery.
  2. Metadata emulation cost. POSIX stat, rename, and chmod have no cheap S3 equivalent. Each call became an API request.
  3. Write amplification. Object stores are immutable. Every "file modification" is a full-object rewrite.

Uber's eventual solution was not a better FUSE client. It was Alluxio plus direct S3 API — bypassing POSIX emulation entirely and training the pipeline to speak object-native. The lesson: POSIX-on-S3 cannot be fixed at the client layer alone. The storage service itself must change.

Why POSIX-on-S3 works now

The 2010s attempts failed because S3 was the wrong abstraction. The 2020s attempts succeed because S3 itself has changed. Five enabling shifts make POSIX-on-S3 viable for the first time.

1. Strong consistency (December 2020). Before this, S3 offered eventual consistency for overwrite PUTs and DELETEs — a read-after-write could return the old version. For training pipelines recovering checkpoints, this was a data-corruption risk. Strong consistency guarantees that a write is immediately visible to all readers, which is the prerequisite for any filesystem semantics.

2. S3 Express One Zone (2023-2024). Standard S3 latency is measured in hundreds of milliseconds for the first byte. S3 Express One Zone delivers single-digit millisecond latency and 2 million requests per second. Directory buckets provide a namespace that behaves like a filesystem prefix — without the LIST-operation tax that killed S3FS. For ML training scratch data, this is the performance tier that makes object storage competitive with NFS.

3. Directory buckets and native directory semantics (2024). S3's original flat namespace (s3://bucket/key) had no native concept of directories — only prefixes. The new directory-bucket type adds native directory support, enabling efficient ls and mkdir operations without prefix enumeration. This removes the metadata-emulation cost that plagued FUSE clients.

4. Amazon S3 Tables and Iceberg V3 (2024-2025). Object storage is no longer just bytes. S3 Tables provides native Apache Iceberg support with automatic compaction, z-order sorting, and snapshot management — 10× higher TPS than self-managed Iceberg. When POSIX semantics are layered on top, the table format handles schema evolution and partitioning while the POSIX layer handles file-level access.

5. GPU-Direct Storage 2.0 (2025). NVIDIA's latest release enables 192 GB/s streaming to 2,048 GPUs simultaneously via parallel S3-compatible endpoints. Meta implemented this across their research clusters and eliminated the 35% of compute time previously wasted on data loading3. The GPU now reads directly from object storage, bypassing CPU bounce buffers entirely.

The common thread: POSIX-on-S3 is not fixed by better FUSE clients. It is fixed by S3 becoming a storage service that natively supports the operations POSIX requires. The gap is closing from the server side, not the client side.

What S3 Files actually is

Amazon S3 Files, announced April 2026, is the most significant POSIX-on-S3 implementation to date because it does not emulate POSIX — it provides native POSIX semantics through a tiered architecture that preserves S3's object-store advantages while adding filesystem capabilities.

The two-tier architecture. S3 Files combines two existing services: an EFS (Elastic File System) caching tier at the edge and S3 Standard as the durable backing store. When a file is created or modified, the EFS tier handles POSIX operations immediately — rename, chmod, symlink, atomic writes — with full POSIX semantics. The EFS tier then asynchronously writes changes back to S3 Standard in batches.

The 60-second write-back window. This is the key design parameter. Writes are durable on EFS immediately (visible to all clients in the same filesystem), but propagate to S3 within 60 seconds. For training workloads, this is acceptable: checkpoint writes are bursty and followed by computation gaps. The 60-second window enables batching, which reduces S3 PUT costs by coalescing small writes into larger objects.

25,000 concurrent NFS connections. S3 Files exposes a standard NFS v4.1 / v4.2 interface (per the April 7, 2026 AWS launch — both versions supported), not a custom protocol. This means existing training infrastructure — PyTorch torchrun, Horovod, Ray — connects without modification. The 25,000 connection limit is per-filesystem, sufficient for all but the largest supercomputing clusters.

Conflict resolution: "S3 wins." If the same object is modified through both the POSIX path (S3 Files) and the object path (direct S3 API), the S3 object version takes precedence. This is a deliberate design choice: S3 Files is a convenience layer on top of S3, not an equal peer. Engineers must design workflows that avoid dual-path writes to the same key.

Semantics preserved vs. sacrificed.

  • Preserved: POSIX permissions, atomic rename (via EFS), symbolic links, file-level locking, directory operations.
  • Sacrificed: Hard links (S3 objects are immutable), mandatory byte-range locking, real-time fsync durability (60-second window applies).

Cost framing. For workloads where 80%+ of data is cold (typical for training datasets with active-research subsets), S3 Files runs approximately 40% cheaper than EFS alone — the S3 backing tier is priced at object-storage rates, while only the hot working set pays file-system rates.

The 2026 POSIX-on-S3 landscape

The POSIX-on-S3 landscape in 2026 is not a single solution but a spectrum — six tools with different latency guarantees, cost profiles, and POSIX completeness. Understanding where each fits prevents the mistake of choosing one tool for all workloads.

Latency figures below are approximate and workload-dependent — small-file random read on a warm cache vs large sequential read on a cold cache will differ by 10× or more. Treat the column as relative ordering, not absolute benchmarks.

Solution Latency Cost Atomic Rename Hard Links Best For Status
S3 Files ~10ms (EFS tier) 40% cheaper than EFS for cold-heavy ✅ Full ❌ No Training scratch, POSIX-native apps AWS native, 2026
Mountpoint (alpha) ~100ms Standard S3 pricing ❌ No ❌ No Read-heavy analytics, checkpoint read Alpha, open-source
JuiceFS ~1-5ms Redis metadata + S3 storage ✅ Full ✅ Yes High-performance computing Production
GeeseFS ~50-100ms Standard S3 + cache ⚠️ Partial ❌ No Goofys replacement, better cache Production
S3FS-FUSE 200ms+ LIST tax + standard S3 ❌ No ❌ No Single-user file browsing Legacy
Goofys ~100ms Standard S3 + write amplification ❌ No ❌ No Sequential read-heavy workloads Maintenance mode

S3 Files is the reference for full POSIX with managed infrastructure. It is the only solution with true atomic rename, directory operations, and POSIX permissions backed by AWS's operational guarantee. The tradeoff is AWS-only and the 60-second write-back window.

Mountpoint for S3 is Amazon's open-source alpha client. It is explicitly not a full POSIX filesystem — it is a high-throughput read client. No atomic rename, no hard links, no write-back caching. Where it excels: reading large checkpoint files from S3 at throughput approaching native S3 bandwidth, without copying to local disk first. For training pipelines that read checkpoints once per epoch, Mountpoint is the simplest option.

JuiceFS is the outlier. It stores metadata in Redis (or another key-value store) and data in S3. This gives it full POSIX semantics — including hard links — at the cost of an additional infrastructure dependency. For HPC clusters that already run Redis, this is attractive. For cloud-native training pipelines, it adds operational complexity.

GeeseFS is Goofys rewritten in Go with a better caching layer and fewer race conditions. It handles the Goofys use case (sequential read-heavy object access) without the maintenance burden of the original. But it is still a FUSE client, not a native POSIX service — partial rename support, no hard links.

The decision matrix is simple:

  • Need full POSIX + managed service → S3 Files
  • Need read-only S3 access at native throughput → Mountpoint
  • Already have Redis + need hard links → JuiceFS
  • Replacing Goofys → GeeseFS
  • Legacy single-user browsing → S3FS (not recommended for production)

What this means for training pipelines

The traditional ML training workflow has been an exercise in fighting the storage layer. Dataset lives in S3 (cold, cheap). Pre-job: copy dataset to local NVMe, which takes hours for terabyte-scale data. Training runs against local NVMe. Post-job: copy checkpoints back to S3. The copy steps often dominate total job time, and NVMe capacity caps the dataset size — even if S3 has petabytes available, you can only train on what fits on the box.

POSIX-on-S3 eliminates the copy step. The dataset stays in S3. The training pipeline mounts S3 Files (or Mountpoint for read-only) and reads through ordinary Python file I/O. Checkpoints write to the same mounted path and propagate back to S3 automatically. PyTorch's DataLoader opens S3-backed files as if they were local. NVMe stops being the capacity ceiling.

The empirical effect compounds with GPUDirect Storage and a cache tier. Meta's published configuration sustains 192 GB/s to 2,048 H100 GPUs3, lifting GPU utilization from the industry-typical sub-50% range to 80%+. The bottleneck shifted from "we can't feed the GPUs" to "we have to think about the cache layer." That's a different problem — a better one.

For checkpoint pipelines specifically, the POSIX-on-S3 win is sharper. Checkpoint writes happen every few hundred steps; they're bursty, large, and have to be atomic. The 60-second write-back window aligns naturally with the cadence — by the time the next checkpoint fires, the previous one has propagated to S3. And because S3 Files preserves atomic rename, the standard write_to_tempfile + rename pattern that Python frameworks have always used works without modification. No retraining of the codebase. The framework doesn't even know S3 is underneath.

The same shape, drawn three different ways

The POSIX gap closing is not an AWS-only story. China's East Data West Computing initiative has produced parallel solutions with similar architecture but different geopolitical constraints.

Aliyun CPFS+OSS Hybrid. Aliyun's Cloud Parallel File System (CPFS) provides a POSIX front-end backed by Object Storage Service (OSS). The architecture separates metadata and data: a dedicated metadata server handles POSIX operations with sub-1 ms latency, while data lives in OSS at object-storage cost. Claimed throughput: 100 GB/s for parallel reads. The design is functionally similar to S3 Files+EFS — POSIX semantics at the edge, object durability at the core — but built on Aliyun's domestic stack rather than AWS's. For Chinese labs training on Ascend or H800 clusters, this is the native path. Qwen 3 was trained on this exact stack — CPFS over OSS, sitting inside Alibaba's Panjiu AI Infra 2.0 with HPN 8.0 RDMA-accelerated networking — which is part of why Aliyun's hyperscaler position in China is reinforcing rather than declining as the East Data West Computing build-out matures.

Note: The 100 GB/s and sub-1 ms figures are Aliyun vendor claims. Independent benchmarks are not yet available.

The OSS substrate underneath has its own price-and-quirk story worth threading in. Five tiers run from Standard at ~$0.017/GB-month down to Deep Cold Archive at ~$0.0011/GB-month — the lowest archival tier on the public market, and well below AWS Glacier Deep Archive's ~$0.00099 floor at scale once you factor in egress economics inside mainland China. In April 2026 Alibaba raised infrequent-access and archive tiers by 4.6–5.6%, citing AI-infrastructure investment; Standard pricing held flat. Storage budgets built on pre-hike numbers need a refresh, and Standard is now the most cost-stable tier for hot training data.

The compatibility story has a sharper edge than "S3-compatible" usually implies. OSS does not support AWS SDK v2's default STREAMING-UNSIGNED-PAYLOAD-TRAILER chunked encoding — the request signing mode that Apache Iceberg and Polaris ship with by default. Teams porting v2-SDK lakehouse stacks onto OSS need to configure the older STREAMING-AWS4-HMAC-SHA256-PAYLOAD signer or disable chunked encoding entirely. It's the kind of footgun that shows up not at deploy time but at the first multi-gigabyte write under load, when the request signing fails halfway through a payload and the table format thinks corruption has occurred. The ossfs FUSE driver is similarly partial: workable for sequential reads and admin scripting, but missing hard links, extended attributes, and robust file locking — which is exactly why CPFS exists as the metadata-heavy training-pipeline mount. The takeaway is the universal pattern again: the POSIX layer is real, but it's a separate product from the object store, and trying to use the bare object store as a filesystem reproduces the same frustrations on either side of the Pacific.

Huawei OBS+MindSpore Integration. Huawei's Object Storage Service integrates natively with MindSpore, their deep-learning framework. The key feature is a parallel prefetch pipeline: the framework calls obs.read() and the storage layer stages the next batch while the GPU processes the current one. Huawei claims this achieves 80% GPU utilization versus ~50% for raw S3 loading. The integration is deeper than AWS's generic S3 API — it is framework-aware, not just storage-aware.

Note: The 80% GPU utilization figure is a Huawei vendor claim. Independent verification needed before asserting as fact.

GLM-5 Training Infrastructure. Zhipu AI's 744B-parameter GLM-5 was trained on a cluster of 100,000+ Huawei Ascend 910B NPUs. The storage topology mirrors the East–West pattern: NVMe local scratch for active checkpoints, OBS for datasets and historical checkpoints, and Archive OBS for model-version retention. The tiered shape is identical to Western training clusters — NVMe → Object → Archive — but the specific services (OBS, not S3) and the hardware (Ascend 910B, not H100) are domestic.

The universal pattern. Whether AWS (S3 Files+EFS), Aliyun (CPFS+OSS), or Huawei (OBS+MindSpore), the architecture converges on the same three-layer stack: POSIX-capable edge tier, object-backed durable tier, and archive tier for compliance. The POSIX gap is closing everywhere because the workload pressure — GPU starvation from slow data loading — is universal.

The divergence. What differs is access to the services. AWS S3 Files is available globally but priced in USD. Aliyun CPFS+OSS is domestic-China only, subject to data localization law. Huawei OBS is tightly coupled to Ascend hardware and MindSpore framework. Engineers building multi-region training infrastructure must choose three different POSIX-on-object solutions for three different regulatory regimes — the same three data gravity wells that shape every other axis of cloud architecture in 2026.

POSIX and table formats are orthogonal

A common confusion in the POSIX-on-S3 conversation is that it competes with table formats like Iceberg V3 or Delta Lake. It doesn't. The two layers operate at different abstractions and complement each other.

POSIX-on-S3 gives you file-level semantics — open, read, write, close, rename, lock. It's how application code interacts with the filesystem.

Iceberg V3 and S3 Tables give you table-level semantics — schema evolution, ACID transactions, time travel, snapshot isolation, compaction. They're how analytical engines interact with structured data.

You use them together for serious AI/lakehouse pipelines: training data lives as Parquet files inside Iceberg V3 tables on S3, and the training pipeline mounts the Iceberg-managed S3 path via S3 Files (or Mountpoint) to read those Parquet files via ordinary file I/O. The table format guarantees the data is consistent and queryable; the POSIX layer makes it accessible to the framework without rewriting to a SQL interface. Use them apart only when you genuinely need just one layer's guarantees.

What POSIX-on-S3 doesn't replace: HDFS, NFS appliances, or local NVMe for the absolute hottest scratch space. What it does replace: the elaborate sync pipelines that used to copy data between S3 and a "real" filesystem before training. That's the practical effect — not a new storage layer, but the elimination of an entire workflow stage.

The two-post arc closes here. May 7: storage matters because GPUs starve when it doesn't. May 10: the access model finally caught up to the demand. The next post is the next layer of the conversation, whatever the news cycle decides to make legible — but the index is being drawn ahead of it. The map keeps getting drawn.


Works cited

  1. Announcing Amazon S3 Files — AWS launch notice (April 2026), with the 25,000 concurrent NFS connection limit (v4.1 and v4.2 both supported per the launch announcement) and 60-second write-back window.
  2. Amazon S3 strong consistency announcement (Dec 2020) — the read-after-write guarantee that made all subsequent POSIX-on-S3 work possible.
  3. Mountpoint for S3 (open-source alpha) — read-optimized FUSE client, no atomic rename, intentionally not a full POSIX filesystem.
  4. JuiceFS — Redis-metadata + S3-data architecture giving full POSIX including hard links.
  5. GeeseFS (Yandex) — Go-rewritten Goofys with better caching.
  6. Aliyun CPFS documentation (aliyun.com); Huawei OBS+MindSpore integration guide; Zhipu AI GLM-5 training infrastructure coverage (Feb 2026).

Footnotes

  1. Object Storage for AI: Implementing GPU Direct Storage with 200GB/s Throughput — empirical attribution of ~80% of training wall-clock to data loading at hyperscaler-grade workloads.

  2. Uber's S3FS-FUSE migration is referenced in Alluxio's published case studies on AI/ML data-loading bottlenecks. Specific dollar figures vary by source; we use a directional "tens of thousands of dollars per month" framing rather than asserting a precise number we cannot independently verify.

  3. Accelerating AI With High Performance Storage — 192 GB/s sustained Meta deployment to 2,048 GPUs via parallel S3-compatible endpoints; 35% compute-time-recovery attribution. 2