Durable runs — crash mid-workflow, resume where you left off¶
The most common question about agent runtimes: "what happens if a step fails in the middle?"
AgentFlow4J checkpoints after every node. If a step fails, the process crashes, or you deliberately pause for a human, you resume(runId) and execution continues from the node that didn't finish — the steps that already succeeded do not re-run (and their LLM calls aren't paid for again).
The store¶
State is persisted through a CheckpointStore. Two production backends ship in agentflow4j-checkpoint (JDBC and Redis); InMemoryCheckpointStore lives in the graph module for tests.
Typed state (StateKey<T>) is serialized with a Jackson codec. Register the keys you put in state so they survive the round-trip:
StateKey<String> ROUTE = StateKey.of("route", String.class);
JacksonCheckpointCodec codec =
new JacksonCheckpointCodec(new StateTypeRegistry().register(ROUTE));
JdbcCheckpointStore store =
new JdbcCheckpointStore(jdbcTemplate, new DataSourceTransactionManager(dataSource), codec);
store.createTableIfMissing();
(For Redis: new RedisCheckpointStore(redisOperations, codec) — same CheckpointStore interface, nothing else changes.)
Run with a runId¶
AgentGraph graph = AgentGraph.builder()
.addNode("plan", planAgent)
.addNode("approve", approveAgent) // interrupts for human sign-off
.addNode("dispatch", dispatchAgent)
.addEdge("plan", "approve")
.addEdge("approve", "dispatch")
.checkpointStore(store)
.build();
AgentResult result = graph.invoke(AgentContext.of("ticket-42"), "run-42");
plan runs and writes ROUTE=team-A to state. approve returns AgentResult.interrupted("need human approval"). The graph persists a checkpoint whose nextNode is approve, with the state captured, and returns — dispatch never fires.
result.isInterrupted(); // true
store.load("run-42").orElseThrow().nextNode(); // "approve"
store.load("run-42").orElseThrow().context().get(ROUTE); // "team-A" — survived serialization
Resume — even after a full restart¶
The checkpoint is the only thing that needs to survive. You can rebuild the graph from scratch (new process, new agent instances) and resume off the store:
// fresh JVM, brand-new AgentGraph + agents, same CheckpointStore
AgentResult done = graph.resume("run-42", new UserMessage("approved by alice"));
done.text(); // "dispatched to team-A"
What happens:
- plan does not re-run — it already completed before the checkpoint
- approve re-enters (it's the node that interrupted), now sees the human's message and proceeds
- dispatch runs with the state restored from the checkpoint (ROUTE=team-A)
- on success, the checkpoint is deleted automatically
The three failure modes, handled¶
| Failure | Behaviour |
|---|---|
| Transient (network blip, rate limit) | RetryPolicy retries with backoff — RetryPolicy.exponential(3, Duration.ofMillis(200)), per-node override, predicate for which exceptions are worth retrying |
| Crash mid-run (JVM dies at step 3 of 5) | resume(runId) picks up at step 3; steps 1–2 don't re-run |
| Permanent (bad input, logic error) | ErrorPolicy decides: FAIL_FAST stops with the error surfaced, SKIP_NODE logs and continues, RETRY_ONCE tries once more |
Combine them: a flaky step retries, a crashed run resumes, an unrecoverable step stops cleanly — and the run log shows exactly which node failed and why.
See also¶
- Resilience & error handling — RetryPolicy, ErrorPolicy, circuit breaker
- Approval gate — the human-in-the-loop variant of "interrupt then resume"
- Run log — the timeline that makes a failed run debuggable