Guide 41

Composing the AI Agent Stack on S3 — Memory, Orchestration, and Integration

Problem Framing

Guides 37–40 cover the AI agent stack one layer at a time — memory (Guide 37), KV-cache (Guide 38), MCP integration (Guide 39), GPUDirect transport (Guide 40). But production agents are compositions, not single layers. The May 2026 Wave 3 + Wave 4 additions (MCP Gateway, Durable Agent Runtime, Letta / Cognee / OpenMemory MCP, Bedrock AgentCore Runtime, A2A protocol, "Agent State Loss on Pod Eviction" as a named pain point) made the layer-cake explicit: a production agent in 2026 is a six-layer stack — substrate, memory, orchestrator, tool fabric, durable runtime, gateway — plus an optional cross-cutting agent-to-agent dimension. The choice at each layer constrains the next. This guide walks the composition: which layers go together, where they bind, and what S3 holds at each seam.

Relevant Nodes

  • Topics: AI Memory Infrastructure, AI Memory Governance, AI Runtime Infrastructure, Distributed Context Systems
  • Technologies: LangGraph, Letta, Kitaru, Mem0, Zep, Graphiti, Chroma, Cognee, Supermemory, OpenMemory MCP, LMCache, Mooncake, Vestige, LiteLLM, Helicone AI Gateway, Traefik AI Gateway, Amazon Bedrock AgentCore Runtime
  • Standards: Model Context Protocol (MCP), MCP Tasks Primitive (SEP-1686), Agent2Agent (A2A) Protocol, Agent Communication Protocol (ACP), Agent Network Protocol (ANP), S3 API
  • Architectures: MCP Gateway, Durable Agent Runtime, Inner/Outer Harness Pattern, MCP Knowledge Graph, Animesis CMA, Tier 3.5 (Inference Context Memory Storage), KV-Cache Disaggregation, Hierarchical KV Cache Architecture
  • Pain Points: Agent State Loss on Pod Eviction, Confused Deputy Problem (MCP), Tool Discovery Governance Gap, Context Bottleneck, Memory Wall, Memory Lineage Gap, Retrieval Freshness Decay, Embedding Drift

Decision Path

  1. Anchor the stack on the S3 substrate.

    • Every layer below assumes S3-compatible object storage as the durable spine. Memory engines persist embeddings and raw events there. KV-cache pools spill there. MCP servers cache tool artifacts there. Audit logs and traces land there.
    • This is the only decision that's hard to reverse — switching engines is migration, switching the substrate is rewrite. Pick S3 first; pick the engine choices below independently.
  2. Choose the memory layer's binding shape, not just the engine.

    • Defer the engine choice itself to Guide 37 (Mem0 / Zep / Graphiti / Vestige / build-your-own), now joined by Letta (OS-style core/recall/archival split), Cognee (dual graph + vector index), OpenMemory MCP (local-first MCP-delivered memory), and Supermemory (managed SaaS for non-infra teams).
    • The composition question is how memory is reachable. Mem0, Zep, and Supermemory expose REST APIs — any orchestrator can hit them. Graphiti, Chroma, and Cognee are libraries — they bind into the orchestrator's process. Vestige and OpenMemory MCP deliver as MCP servers — orchestrator-agnostic and discoverable at runtime. Letta sits in both worlds: SDK plus MCP endpoints.
    • Composition rule: memory must be reachable from every node in the agent graph, not just the entrypoint. Deploy it as a service (REST or MCP) the orchestrator calls, not as an in-process dependency that locks the orchestrator choice. MCP-delivered memory has the longest reach because it bridges all MCP-aware clients (Claude Desktop, Cursor, Cline) onto one shared backend.
  3. Pick the orchestrator.

    • LangGraph — graph-based agent runtime; nodes are tool calls or LLM invocations, edges are control flow with explicit state. Best for multi-step agents with branching logic, human-in-the-loop checkpoints, or replayable state machines. Persists graph state to a checkpointer (often Postgres + S3 for artifacts).
    • Build-your-own async loop with an MCP client — cheaper than a framework when the agent loop is well-understood and single-purpose. Trade-off: you re-implement retry, checkpointing, and observability.
    • Composition rule: the orchestrator should not also own memory. Memory living outside the orchestrator lets you swap orchestrators without losing user state, and lets multiple orchestrators (a chat agent and a batch summarizer) share the same memory backend.
  4. Pick the tool-integration fabric — and a Gateway in front of it.

    • MCP (Model Context Protocol) — 2026's de-facto standard. Tools become MCP servers; agents become MCP clients. Defer to Guide 39 for protocol detail. SEP-1686 (MCP Tasks Primitive) adds standard long-running tool invocation; reach for it when individual tool calls take minutes-to-hours.
    • MCP Gateway — once you have more than three or four backend MCP servers, add a state-aware gateway (Bifrost, Tyk MCP, Amazon API Gateway MCP proxy). It multiplexes tools across servers, applies per-event policy, enforces OAuth 2.1, runs semantic caching against tool calls, and prevents the Tool Discovery Governance Gap + Confused Deputy Problem (MCP) pain points. Traditional REST gateways (Kong, Apigee) are not protocol-aware enough — they assume stateless request/response, and MCP is bidirectional SSE.
    • Custom function-calling — viable when the tool surface is small and stable. Avoids MCP scaffolding cost.
    • Composition rule: if any tool needs to be reused across more than one agent, ship it as an MCP server. If you have multiple MCP servers, put a gateway in front of them — the gateway is also the right place to centralize tool-call observability for the entire fleet.
  5. Wire in an AI gateway between orchestrator and LLM provider.

    • LiteLLM — unified OpenAI-compatible interface in front of any provider (OpenAI, Anthropic, Bedrock, self-hosted models on S3). Adds routing, retries, per-tenant rate limiting, and cost attribution. Often the first gateway teams reach for because the API surface is identical to what they were already calling.
    • Helicone AI Gateway — observability-first; intercepts LLM traffic to log every prompt, response, latency, and cost. Trace storage lands in S3. Best when post-hoc analysis ("what did this agent actually send?") matters more than routing flexibility.
    • Traefik AI Gateway — extension of the existing Traefik HTTP proxy with AI-specific routing rules and token-bucket rate limits. Best when you already run Traefik for the rest of your traffic and want one fewer thing to operate.
    • Composition rule: the AI gateway sits between the orchestrator and the model provider — distinct from the MCP Gateway (which sits between agents and tools). Picking one early means observability and cost-tracking are built into the agent stack rather than bolted on six months later.
  6. Add a durable runtime — the outer harness — once agents run longer than a request.

    • The Inner/Outer Harness Pattern separates the agent's inner concerns (prompt shape, tool selection, model choice) from its outer concerns (failure recovery, resumability, async suspension). The outer harness is the durable runtime layer.
    • Kitaru (ZenML) — agent-shape-optimized open-source runtime; checkpoints at each step boundary to S3, resumes from last successful boundary after pod eviction or function timeout. Pairs with Pydantic AI, LangGraph, LlamaIndex.
    • Restate / Temporal / Inngest — heritage durable-execution frameworks adapted for agent use. Restate brings strongly consistent virtual objects; Temporal has the deepest workflow-engineering history.
    • Amazon Bedrock AgentCore Runtime — managed AWS path; isolated Firecracker microVMs per agent session preserve stateful MCP features across the otherwise stateless MCP 2026-07-28 transport. Use when AWS-native IAM/VPC/KMS integration is a hard requirement.
    • Composition rule: the durable runtime layer is what kills the Agent State Loss on Pod Eviction pain point. Without it, a pod eviction at step 11 of a 12-step research synthesis burns 30 minutes of LLM compute and the entire token spend. With it, spot instances and aggressive autoscaling become viable for agent workloads. Adopt as soon as any agent run exceeds the median pod lifetime.
  7. For multi-agent systems, add A2A (or ACP / ANP) above MCP.

    • MCP standardizes how an agent talks to its tools. Agent2Agent (A2A) standardizes how one agent talks to another agent. They are complementary, not competing — the canonical 2026 stack uses MCP for agent→tool I/O and A2A for agent→agent communication.
    • ACP (Agent Communication Protocol) — high-throughput local multi-agent. ANP (Agent Network Protocol) — trust-decentralized federation. Pick based on the deployment shape; the four-protocol taxonomy is formalized in arXiv:2505.02279.
    • Composition rule: if your system has a single agent, MCP alone is enough. The moment a second agent enters the picture, pick an inter-agent protocol on day one — retrofitting pair-wise integration becomes O(N²) glue code.
  8. Plan KV-cache and governance from the start, not as afterthoughts.

    • For high-prefix-reuse workloads (shared system prompts, few-shot examples, persistent agent identity), layer in LMCache or Mooncake (Guide 38) at the model-serving tier. The KV-Cache Disaggregation and Hierarchical KV Cache Architecture patterns generalize this: separate prefill from decode, tier KV state across HBM → DRAM → CXL → NVMe → S3. These sit below the gateway and are invisible to the orchestrator — the gateway is where you observe whether they're firing.
    • For governance: Animesis CMA framing splits memory into a Constitution + Core (immutable) and Peripheral + Raw Event Log (prunable). Forgetting-as-a-Service is the deletion-path obligation. The MCP Knowledge Graph pattern adds tool-call provenance and authorization auditing at the gateway tier.
    • Composition rule: every layer that writes to S3 should write with provenance metadata (which agent run, which user, which session, which durable-runtime checkpoint). This is what makes the Memory Lineage Gap pain point tractable later — and it's much cheaper to wire in on day one than to backfill.
  9. Default stack for greenfield projects in 2026 (opinionated — placeholder for J's editorial pick, see TODO below)

What Changed Over Time

  • 2024: Agents were LangChain monoliths — orchestration, memory, and tool calling lived in the same Python process. Composition wasn't a question; you got whatever the framework gave you. Pod evictions silently destroyed agent state and nobody had a name for it.
  • Mid-2025: LangGraph reframed orchestration as durable graph state separate from agent logic. Mem0 reframed memory as temporal-by-default rather than overwrite-by-default. Letta (then MemGPT) reframed memory as OS-style core/recall/archival. Tool integration was still ad-hoc.
  • Late 2025: MCP arrived; tools became discoverable servers rather than function definitions baked into agent code. Composition became possible because every layer now had a wire protocol. A2A arrived alongside, separating tool-protocol from agent-to-agent-protocol.
  • 2026 (Q1–Q2): AI gateways (LiteLLM, Helicone, Traefik AI) emerged as the LLM-side traffic-management tier. MCP Gateways (Bifrost, Tyk MCP, AWS API Gateway MCP proxy) emerged as the tool-side traffic-management tier — solving the federated-tool-discovery and policy-enforcement gaps. Durable Agent Runtimes (Kitaru, Restate, Bedrock AgentCore) crystallized the inner/outer harness pattern and named "Agent State Loss on Pod Eviction" as a first-class pain point. The agent stack is now six distinct layers — substrate, memory, orchestrator, tool fabric (with MCP Gateway), AI gateway, durable runtime — each independently swappable.
  • Forward: Expect convergence on MCP for both tool and memory integration (Letta, Mem0, OpenMemory MCP, Zep already ship MCP endpoints). Expect AI gateway and MCP gateway features to merge in some implementations (Bifrost already runs semantic caching on the tool side; LiteLLM is the same primitive on the model side). Expect durable-runtime checkpointing to become a default rather than an add-on layer.

Sources