Guide 44

Choosing a Durable Agent Runtime — Kitaru vs. Temporal vs. Restate

Problem Framing

An autonomous agent is a recursive while-loop that can run for minutes to days; a Kubernetes pod eviction at step 11 of 12 burns 30 minutes of LLM compute and dozens of dollars in token spend. Agent State Loss on Pod Eviction is the load-bearing pain point. Durable Agent Runtimes solve it by persisting every step-boundary's inputs / intermediate outputs / LLM responses to S3-compatible object storage, then resuming from the last successful boundary on failure. The 2026 question is which runtime fits which workload shape.

Relevant Nodes

  • Topics: Agent Orchestration
  • Technologies: Kitaru, Amazon Bedrock AgentCore Runtime
  • Architectures: Durable Agent Runtime, Inner/Outer Harness Pattern, FAME Architecture
  • Pain Points: Agent State Loss on Pod Eviction

Decision Path

  1. Confirm you need a durable runtime at all. If your agent runs in well under a minute and is idempotent on retry, you may not need this layer. The economic break-even is roughly: (per-run cost × eviction probability × tail length) > runtime operational cost. For multi-minute agentic runs on elastic compute, you almost always need it.

  2. Option A — Kitaru (ZenML, open source, Python-first):

    • Best for: Python agent stacks (Pydantic AI, LangGraph, LlamaIndex, custom). Multi-modal artifact persistence. Workloads that want a single tool-and-step-boundary primitive without a separate workflow language.
    • Differentiator: Agent-shape-optimized — versioned artifact storage in S3, pause / resume aligned with LLM generation cycles, replay-debugging where a failed run becomes an inspectable S3 artifact rather than a stack trace.
    • Trade-off: Younger than Temporal / Restate; ecosystem still growing. Tight Python coupling.
  3. Option B — Temporal (open source, polyglot, mature):

    • Best for: Mixed workflow + agent estates. Teams that already run Temporal for non-AI workflows. Polyglot environments (Go / Java / TypeScript / Python).
    • Differentiator: Heritage durable-execution framework. Strongest production track record. Rich workflow primitives (signals, queries, child workflows).
    • Trade-off: Workflow-shaped rather than agent-shaped; you write more boilerplate to express "agent loop with tool calls" than Kitaru. Multi-modal artifact handling is bolt-on.
  4. Option C — Restate (open source, strongly consistent virtual objects):

    • Best for: Workloads where the consistency guarantees matter as much as resumability — financial agents, regulated transactional pipelines, multi-actor coordinations that need exactly-once semantics.
    • Differentiator: Journaled event log, strongly consistent virtual objects, integrated with traditional microservices.
    • Trade-off: More transactional in flavor than agent-shaped; ZenML's positioning explicitly frames Restate as workflow-optimized vs Kitaru being agent-optimized.
  5. Option D — Managed: Amazon Bedrock AgentCore Runtime:

    • Best for: AWS-native workloads. Teams that want managed Firecracker-style microVM isolation per agent session without operating the infrastructure.
    • Differentiator: Stateful runtime over stateless MCP transport — preserves elicitation / sampling / progress notifications across MCP 2026-07-28's stateless protocol layer. First managed product to abstract this away.
    • Trade-off: AWS lock-in; less flexible than self-hosted Kitaru / Temporal.
  6. Always layer the Inner / Outer Harness Pattern. Whichever runtime you choose, the durable runtime is your outer harness; the agent SDK (Pydantic AI, LangGraph, etc.) is the inner harness. Keep them independent so you can swap models without rewriting infrastructure and swap runtimes without rewriting tool-calling logic. See Guide 47.

What Changed Over Time

  • 2024: Long-running agents on elastic compute meant losing work to pod eviction. No standard answer.
  • 2025: Temporal adapted for agent workloads; Restate emerged with strongly-consistent virtual objects.
  • 2026: Kitaru (ZenML) shipped explicitly as the agent-shaped outer harness. Amazon Bedrock AgentCore Runtime launched as the first managed product. The Inner / Outer Harness Pattern became consensus framing.
  • Forward: Mature durable-runtime adoption makes spot-instance agent economics viable (60–80% cost reduction at moderate eviction rates).

Sources