Lv.1 0 XP

Crash Recovery — Building Agents That Survive Failures

Core 8 min +35 XP
💡
THE ANALOGY

A flight plan with checkpoints. If the plane has to divert due to weather, it can resume from the last checkpoint rather than returning to the departure airport. Your agent's state checkpoints are those waypoints — crash and resume from the last known good state.

⚠️ EXAM TRAP — The Wrong Answer People Choose

Thinking crash recovery requires complex infrastructure. The simplest effective pattern is: save state after every iteration, load state on startup. If the agent crashed, it loads the last saved state and continues.

KEY POINTS
1 Crash recovery requires saving agent state (messages array + metadata) after every loop iteration, before returning control to the caller.
2 On agent startup, check for an existing saved state — if found, resume from that point rather than starting over.
3 Each agent in a multi-agent system should export its state independently — the coordinator loads a manifest of agent states on resume.
4 State export format matters: include enough context to reconstruct the agent's situation without the raw conversation history.
5 Idempotent tool calls are safe to re-execute on recovery — non-idempotent calls (sending emails, processing payments) require deduplication.