Durable Agent Runtime

Definition

What it is

An architectural pattern in which an LLM agent's execution loop is decomposed into discrete, **checkpointed step boundaries** — at each boundary the runtime persists inputs, intermediate outputs, and LLM responses to a durable substrate (typically S3-compatible object storage), allowing the run to *resume from the last successful boundary* if the underlying worker fails, is evicted, times out, or is intentionally suspended. The pattern explicitly separates the agent's "inner harness" (prompt shape, tool choice, model selection) from the "outer harness" (failure recovery, resumability, infrastructure binding).

Why it exists

Autonomous agents are while-loops that can run for minutes-to-days. The classical stateless-microservice model assumes a request lasts seconds; the agent model breaks every assumption that goes with it. Without a durable runtime, a Kubernetes pod eviction at step 11 of a 12-step research synthesis incinerates 30 minutes of LLM compute and dozens of dollars of token spend. Durable Agent Runtimes are the architectural fix — treat the agent's execution trace as a versioned artifact, treat a failure as a resume point, treat suspension as a feature.

Recent developments

Latest signals

Kitaru, Restate, Temporal, AWS Step Functions Express, Inngest are the active implementations. Kitaru is agent-shape-optimized; Restate brings strongly consistent virtual objects from the traditional workflow world; Temporal is the heritage durable-execution framework adapted for agent use. Per ZenML — Kitaru vs Restate.
Step boundary as the unit of persistence. All implementations converge on the same primitive: a code decorator / annotation that marks a boundary, runtime intercepts the call, persists state, returns the result; failure-resume reads state from object storage and replays from the boundary. Per Pydantic — Runtime layer for Pydantic AI agents.
Async suspension is the killer feature. Agents waiting for human approval / webhook callback / secondary-agent completion no longer hold a worker hot; the runtime persists state and tears down the worker; an inbound event rehydrates state on a freshly-provisioned worker hours-to-days later. Per ZenML — Kitaru product page.
Object storage is the universal substrate. Every Durable Agent Runtime ends up pointing its artifact store at S3-compatible storage — the size + write-frequency profile of agent artifacts (multimodal tool outputs, intermediate LLM responses) makes traditional databases impractical. Per AWS Blog — Building AI Agents from Zero to Hero.