Fault tolerance
Every external call goes through a CircuitBreaker. Every retryable
error has a bounded retry policy with jittered exponential backoff.
Every event survives a NATS outage. A second process cannot race the
first onto the same bus.
This page collects all of those guardrails in one place.
CircuitBreaker
Source: crates/resilience/src/lib.rs.
A three-state machine wrapped around any fallible external call. Once a dependency is failing, the breaker fails fast instead of piling up calls against a dead endpoint; periodic probes let it recover without human intervention.
stateDiagram-v2
[*] --> Closed
Closed --> Open: 5 consecutive failures
Open --> HalfOpen: backoff elapsed
HalfOpen --> Closed: 2 consecutive successes
HalfOpen --> Open: any failure<br/>(backoff × 2, capped)
Defaults
| Field | Default | Meaning |
|---|---|---|
failure_threshold | 5 | consecutive failures before opening |
success_threshold | 2 | consecutive successes in HalfOpen before closing |
initial_backoff | 10 s | wait time on first open |
max_backoff | 120 s | cap on exponential backoff |
Where it wraps
- LLM calls — one circuit per provider (MiniMax, Anthropic, OpenAI-compat, Gemini). A provider outage doesn't cascade to others.
- NATS publish — one circuit over the broker. When it opens the disk queue absorbs writes.
- CDP commands — one circuit per browser session. A dead Chrome doesn't freeze the agent loop.
- Extension stdio — implicit via the
StdioRuntimelifecycle (crashed child → respawn, bounded).
Signals
CircuitBreaker exposes the usual methods (allow(), on_success(),
on_failure()) plus two explicit overrides:
trip()— force Open from outside (e.g. a health check decided the dep is down before a call fails)reset()— force Closed (e.g. the operator just restored the dep and doesn't want to wait for the probe window)
Retry policies
Retries live at a layer above the circuit breaker — they handle transient failures (429, 5xx, network blips) that don't warrant flipping the breaker. Every retry policy uses jittered exponential backoff to avoid thundering-herd reconnection storms.
| Component | Max attempts | Backoff range |
|---|---|---|
| LLM 429 (rate limit) | 5 | 1 s → 60 s, jittered exponential |
| LLM 5xx (server error) | 3 | 1 s → 30 s, jittered exponential |
| NATS publish drain | 3 per event | disk queue drain cycle |
| CDP | via circuit only | backoff = circuit's open window |
These live in crates/llm/src/retry.rs (LLM) and
crates/broker/src/disk_queue.rs (NATS drain).
Error classification
Retries only trigger on retryable errors. A 4xx other than 429 — missing key, invalid model, malformed request — fails fast. The rationale: retrying a misconfigured call wastes budget and still fails. Fail loudly, fix the config.
No message drop
The broker layer guarantees at-least-once delivery for publishes that reach the runtime:
flowchart LR
P[publisher] --> TRY{NATS healthy?}
TRY -->|yes| NATS[(NATS)]
TRY -->|no| DQ[(disk queue)]
DQ --> WAIT{reconnect?}
WAIT -->|yes| DRAIN[drain FIFO]
DRAIN --> NATS
DQ -->|3 failed attempts| DLQ[(dead letters)]
DLQ --> CLI[agent dlq replay]
In the absolute worst case — NATS down forever, disk full — the disk queue starts shedding oldest events at its hard cap, but the producer never crashes and never silently drops.
Single-instance lockfile
A second agent process pointed at the same data directory would
double-subscribe every topic, delivering every message twice. To
prevent that, boot acquires a lockfile and kicks out any stale or
racing instance.
Source: src/main.rs::acquire_single_instance_lock.
flowchart TD
START[agent boot] --> READ[read data/agent.lock]
READ --> EXIST{file exists?}
EXIST -->|no| WRITE[write our PID]
EXIST -->|yes| PID[parse PID]
PID --> ALIVE{/proc/PID/ exists?}
ALIVE -->|no| WRITE
ALIVE -->|yes| SIGTERM[send SIGTERM]
SIGTERM --> WAIT[wait up to 5 s<br/>50 × 100 ms polls]
WAIT --> DEAD{process gone?}
DEAD -->|yes| WRITE
DEAD -->|no| SIGKILL[send SIGKILL]
SIGKILL --> WRITE
WRITE --> LOCK[RAII handle alive]
The SingleInstanceLock RAII struct stores our own PID. On drop it
only removes the lockfile if the stored PID still matches the current
one — so a takeover by a third process doesn't let the original
owner wipe the lock on its way out.
Graceful shutdown
See Agent runtime — Graceful shutdown for the ordered teardown sequence. Key points from a fault-tolerance angle:
- Dream-sweep loops and MCP sessions get explicit grace windows so in-flight work doesn't produce partial state
- Plugin intake is stopped before agent runtimes — the runtimes drain anything already in their mailboxes before exiting
- If the disk queue has unflushed events on SIGTERM, they survive to the next boot
Operator guardrails
Beyond the automatic mechanisms:
- Skill gating — an extension declaring
requires.env = ["FOO"]is skipped at discovery whenFOOis unset, instead of being registered and failing on every invocation. See Extensions — manifest. - Inbound filter — events with neither text nor media (receipts, typing indicators, reactions-only) are dropped before they reach the LLM, saving cost and avoiding noisy turns.
- Health endpoints —
:8080/readyand:8080/liveexpose lifecycle state for k8s liveness / readiness probes. - Metrics —
:9090/metrics(Prometheus) exposes everything from inbound event counts to circuit breaker state; see Metrics.