Fault tolerance

Every external call goes through a CircuitBreaker. Every retryable error has a bounded retry policy with jittered exponential backoff. Every event survives a NATS outage. A second process cannot race the first onto the same bus.

This page collects all of those guardrails in one place.

CircuitBreaker

Source: crates/resilience/src/lib.rs.

A three-state machine wrapped around any fallible external call. Once a dependency is failing, the breaker fails fast instead of piling up calls against a dead endpoint; periodic probes let it recover without human intervention.

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: 5 consecutive failures
    Open --> HalfOpen: backoff elapsed
    HalfOpen --> Closed: 2 consecutive successes
    HalfOpen --> Open: any failure<br/>(backoff × 2, capped)

Defaults

FieldDefaultMeaning
failure_threshold5consecutive failures before opening
success_threshold2consecutive successes in HalfOpen before closing
initial_backoff10 swait time on first open
max_backoff120 scap on exponential backoff

Where it wraps

  • LLM calls — one circuit per provider (MiniMax, Anthropic, OpenAI-compat, Gemini). A provider outage doesn't cascade to others.
  • NATS publish — one circuit over the broker. When it opens the disk queue absorbs writes.
  • CDP commands — one circuit per browser session. A dead Chrome doesn't freeze the agent loop.
  • Extension stdio — implicit via the StdioRuntime lifecycle (crashed child → respawn, bounded).

Signals

CircuitBreaker exposes the usual methods (allow(), on_success(), on_failure()) plus two explicit overrides:

  • trip() — force Open from outside (e.g. a health check decided the dep is down before a call fails)
  • reset() — force Closed (e.g. the operator just restored the dep and doesn't want to wait for the probe window)

Retry policies

Retries live at a layer above the circuit breaker — they handle transient failures (429, 5xx, network blips) that don't warrant flipping the breaker. Every retry policy uses jittered exponential backoff to avoid thundering-herd reconnection storms.

ComponentMax attemptsBackoff range
LLM 429 (rate limit)51 s → 60 s, jittered exponential
LLM 5xx (server error)31 s → 30 s, jittered exponential
NATS publish drain3 per eventdisk queue drain cycle
CDPvia circuit onlybackoff = circuit's open window

These live in crates/llm/src/retry.rs (LLM) and crates/broker/src/disk_queue.rs (NATS drain).

Error classification

Retries only trigger on retryable errors. A 4xx other than 429 — missing key, invalid model, malformed request — fails fast. The rationale: retrying a misconfigured call wastes budget and still fails. Fail loudly, fix the config.

No message drop

The broker layer guarantees at-least-once delivery for publishes that reach the runtime:

flowchart LR
    P[publisher] --> TRY{NATS healthy?}
    TRY -->|yes| NATS[(NATS)]
    TRY -->|no| DQ[(disk queue)]
    DQ --> WAIT{reconnect?}
    WAIT -->|yes| DRAIN[drain FIFO]
    DRAIN --> NATS
    DQ -->|3 failed attempts| DLQ[(dead letters)]
    DLQ --> CLI[agent dlq replay]

In the absolute worst case — NATS down forever, disk full — the disk queue starts shedding oldest events at its hard cap, but the producer never crashes and never silently drops.

Single-instance lockfile

A second agent process pointed at the same data directory would double-subscribe every topic, delivering every message twice. To prevent that, boot acquires a lockfile and kicks out any stale or racing instance.

Source: src/main.rs::acquire_single_instance_lock.

flowchart TD
    START[agent boot] --> READ[read data/agent.lock]
    READ --> EXIST{file exists?}
    EXIST -->|no| WRITE[write our PID]
    EXIST -->|yes| PID[parse PID]
    PID --> ALIVE{/proc/PID/ exists?}
    ALIVE -->|no| WRITE
    ALIVE -->|yes| SIGTERM[send SIGTERM]
    SIGTERM --> WAIT[wait up to 5 s<br/>50 × 100 ms polls]
    WAIT --> DEAD{process gone?}
    DEAD -->|yes| WRITE
    DEAD -->|no| SIGKILL[send SIGKILL]
    SIGKILL --> WRITE
    WRITE --> LOCK[RAII handle alive]

The SingleInstanceLock RAII struct stores our own PID. On drop it only removes the lockfile if the stored PID still matches the current one — so a takeover by a third process doesn't let the original owner wipe the lock on its way out.

Graceful shutdown

See Agent runtime — Graceful shutdown for the ordered teardown sequence. Key points from a fault-tolerance angle:

  • Dream-sweep loops and MCP sessions get explicit grace windows so in-flight work doesn't produce partial state
  • Plugin intake is stopped before agent runtimes — the runtimes drain anything already in their mailboxes before exiting
  • If the disk queue has unflushed events on SIGTERM, they survive to the next boot

Operator guardrails

Beyond the automatic mechanisms:

  • Skill gating — an extension declaring requires.env = ["FOO"] is skipped at discovery when FOO is unset, instead of being registered and failing on every invocation. See Extensions — manifest.
  • Inbound filter — events with neither text nor media (receipts, typing indicators, reactions-only) are dropped before they reach the LLM, saving cost and avoiding noisy turns.
  • Health endpoints:8080/ready and :8080/live expose lifecycle state for k8s liveness / readiness probes.
  • Metrics:9090/metrics (Prometheus) exposes everything from inbound event counts to circuit breaker state; see Metrics.