Durable runs — crash mid-workflow, resume where you left off¶

The most common question about agent runtimes: "what happens if a step fails in the middle?"

AgentFlow4J checkpoints after every node. If a step fails, the process crashes, or you deliberately pause for a human, you resume(runId) and execution continues from the node that didn't finish — the steps that already succeeded do not re-run (and their LLM calls aren't paid for again).

The store¶

State is persisted through a CheckpointStore. Two production backends ship in agentflow4j-checkpoint (JDBC and Redis); InMemoryCheckpointStore lives in the graph module for tests.

Typed state (StateKey<T>) is serialized with a Jackson codec. Register the keys you put in state so they survive the round-trip:

StateKey<String> ROUTE = StateKey.of("route", String.class);

JacksonCheckpointCodec codec =
        new JacksonCheckpointCodec(new StateTypeRegistry().register(ROUTE));

JdbcCheckpointStore store =
        new JdbcCheckpointStore(jdbcTemplate, new DataSourceTransactionManager(dataSource), codec);
store.createTableIfMissing();

(For Redis: new RedisCheckpointStore(redisOperations, codec) — same CheckpointStore interface, nothing else changes.)

Run with a runId¶

AgentGraph graph = AgentGraph.builder()
        .addNode("plan", planAgent)
        .addNode("approve", approveAgent)     // interrupts for human sign-off
        .addNode("dispatch", dispatchAgent)
        .addEdge("plan", "approve")
        .addEdge("approve", "dispatch")
        .checkpointStore(store)
        .build();

AgentResult result = graph.invoke(AgentContext.of("ticket-42"), "run-42");

plan runs and writes ROUTE=team-A to state. approve returns AgentResult.interrupted("need human approval"). The graph persists a checkpoint whose nextNode is approve, with the state captured, and returns — dispatch never fires.

result.isInterrupted();                          // true
store.load("run-42").orElseThrow().nextNode();   // "approve"
store.load("run-42").orElseThrow().context().get(ROUTE);  // "team-A" — survived serialization

Resume — even after a full restart¶

The checkpoint is the only thing that needs to survive. You can rebuild the graph from scratch (new process, new agent instances) and resume off the store:

// fresh JVM, brand-new AgentGraph + agents, same CheckpointStore
AgentResult done = graph.resume("run-42", new UserMessage("approved by alice"));

done.text();   // "dispatched to team-A"

What happens: - plan does not re-run — it already completed before the checkpoint - approve re-enters (it's the node that interrupted), now sees the human's message and proceeds - dispatch runs with the state restored from the checkpoint (ROUTE=team-A) - on success, the checkpoint is deleted automatically

The three failure modes, handled¶

Failure	Behaviour
Transient (network blip, rate limit)	`RetryPolicy` retries with backoff — `RetryPolicy.exponential(3, Duration.ofMillis(200))`, per-node override, predicate for which exceptions are worth retrying
Crash mid-run (JVM dies at step 3 of 5)	`resume(runId)` picks up at step 3; steps 1–2 don't re-run
Permanent (bad input, logic error)	`ErrorPolicy` decides: `FAIL_FAST` stops with the error surfaced, `SKIP_NODE` logs and continues, `RETRY_ONCE` tries once more

Combine them: a flaky step retries, a crashed run resumes, an unrecoverable step stops cleanly — and the run log shows exactly which node failed and why.

Durable runs — crash mid-workflow, resume where you left off¶

The store¶

Run with a runId¶

Resume — even after a full restart¶

The three failure modes, handled¶

See also¶