Agent State Loss on Pod Eviction

Definition

What it is

A pain point characteristic of long-running autonomous agents deployed on elastic compute substrates (Kubernetes, AWS Fargate, Cloud Run, Lambda, EC2 spot instances) where any infrastructure-initiated interruption — pod eviction during a rolling deploy, spot-instance reclamation, function timeout, autoscaler downscale, node failure — destroys *all* in-memory agent state and forces the run to restart from step zero, burning every token spent so far and amplifying end-to-end latency by the elapsed-time-to-failure.

Recent developments

Latest signals

Durable Agent Runtimes (Kitaru, Temporal, Restate, AWS Step Functions Express, Inngest) are the architectural answer. Step-boundary checkpoints to S3 + resume-from-last-success make pod eviction recoverable, not catastrophic. Per ZenML — Kitaru product page.
The "restart-tax" framing has become the load-bearing concept in agent-infrastructure marketing. Vendors quantify their value proposition as percentage reduction in restart cost — a 50% pod-eviction rate without durable runtime ≈ 1.5x effective compute cost; with durable runtime ≈ 1.05x. Per Pydantic — Runtime layer for Pydantic AI agents.
Spot-instance economics now viable for agents. With Kitaru-style runtimes underneath, agent workloads can run on cheap-but-volatile substrates with no operational penalty — the cost-per-completed-run drops 60-80% at moderate eviction rates. Per AWS Builder Center — Building AI Agents from Zero to Hero.
The FAME architecture provides a parallel mitigation path for serverless deployments. Bi-tier state routing (hot state to KV, cold state to S3) lets stateless functions resume cheaply after timeout. Per arXiv 2601.14735 — FAME.

Connections 8

Outbound 4

scoped_to1

AI Runtime Infrastructure

constrained_by3

Durable Agent Runtime Kitaru FAME Architecture

Inbound 4

solves4

Kitaru MCP Tasks Primitive (SEP-1686)Durable Agent Runtime FAME Architecture

Definition

Recent developments

Connections 8

Featured in