Config hot-reload

Operators rotate per-agent knobs (allowlists, model strings, prompts, rate limits, delegation gates) without restarting the daemon. Sessions currently handling a message finish their turn on the old snapshot; the next event picks up the new one (apply-on-next-message). Plugin configs (whatsapp.yaml, telegram.yaml, …) are not hot-reloadable yet — see limitations.

What triggers a reload

TriggerSource
File save under config/notify-based watcher, debounced 500 ms
agent reload CLIPublishes control.reload on the broker
Direct broker publishAny integration can emit control.reload

What's reloaded

Files watched by default (paths relative to the config dir):

  • agents.yaml
  • agents.d/ (recursive)
  • llm.yaml
  • runtime.yaml

Extra paths listed under runtime.reload.extra_watch_paths are appended to the list.

The fields that apply live without a restart:

FieldLocationEffect
allowed_tools (agent + binding)agents.d/*.yamlTool list visible to the LLM + per-call guard
outbound_allowlistsameDefense-in-depth in whatsapp_send_* / telegram_send_*
skillssameSkill blocks rendered into the system prompt
model.model (binding-level)sameLLM model string on next turn
system_prompt + system_prompt_extrasameSystem block composition
sender_rate_limitsamePer-binding token bucket
allowed_delegatessameDelegation ACL
providers.<name>.api_keyllm.yamlRotated via a fresh LlmClient on next turn

Fields that require a restart (logged as warn during reload):

  • id, plugins, workspace, skills_dir, transcripts_dir
  • heartbeat.enabled, heartbeat.interval
  • config.debounce_ms, config.queue_cap
  • model.provider (binding-level provider must match agent provider — the LlmClient is wired once per agent)
  • broker.yaml, memory.yaml, mcp.yaml, extensions.yaml

Adding or removing an agent also requires a restart in this release; see limitations.

Configuration

config/runtime.yaml is optional. Defaults:

reload:
  enabled: true           # master switch
  debounce_ms: 500        # notify-debouncer-full window
  extra_watch_paths: []   # appended to the built-in list

Set enabled: false to turn off the file watcher + the control.reload subscriber. The CLI agent reload still works — the daemon never opens a privileged socket, it just listens on the shared broker.

The reload pipeline

file save / CLI / broker
        │
        ▼
  debouncer (500 ms)
        │
        ▼
  AppConfig::load (YAML + env resolution)
        │
        ▼
  validate_agents_with_providers  ──fail──▶  log warn, bump
        │                                    config_reload_rejected_total,
        ▼                                    keep old snapshot
  RuntimeSnapshot::build (per agent)
        │
        ▼
  ArcSwap::store  (atomic per agent)
        │
        ▼
  events.runtime.config.reloaded

Validation failure never swaps. The daemon always serves a snapshot that passed its boot gauntlet.

CLI

# Human-readable output
$ agent reload
reload v7: applied=2 rejected=0 elapsed=18ms
  ✓ ana
  ✓ bob

# Machine-readable
$ agent reload --json
{
  "version": 7,
  "applied": ["ana", "bob"],
  "rejected": [],
  "elapsed_ms": 18
}

Exit codes:

  • 0 — at least one agent reloaded.
  • 1 — no control.reload.ack within 5 s (daemon not running).
  • 2 — every agent rejected (partial-fail signal for CI).

Broker contract

TopicDirectionPayload
control.reload→ daemon{requested_by: string}
control.reload.ack← daemonserialized ReloadOutcome

ReloadOutcome JSON shape:

{
  "version": 7,
  "applied": ["ana", "bob"],
  "rejected": [
    {"agent_id": "ana", "reason": "snapshot build: ..."}
  ],
  "elapsed_ms": 18
}

Telemetry

MetricTypeLabels
config_reload_applied_totalcounter
config_reload_rejected_totalcounter
config_reload_latency_mshistogram
runtime_config_versiongaugeagent_id

Scrape via the metrics endpoint (ops/metrics).

Apply-on-next-message semantics

A reload does not interrupt sessions that are currently handling a message. Specifically:

  • The LLM turn in flight keeps its captured Arc<RuntimeSnapshot> for the life of the turn — tool calls inside that turn all see the same policy, even if several reloads land during the turn.
  • The next event delivered to the agent reads the latest snapshot via snapshot.load() on the intake hot path.

If you need a "force-apply now" semantic (terminate in-flight sessions, respawn), use agent reload --kick-sessionsnot implemented yet, tracked in Phase 19.

Security model

  • control.reload topic has no application-level auth. Anyone with broker publish rights can trigger a reload. In production with NATS, restrict the control.> subject pattern via NATS account permissions; see NATS with TLS + auth. The local-broker fallback is in-process only — no remote attack surface.
  • File-watcher trust = filesystem write. Whoever can edit config/agents.d/*.yaml can change capability surface. Treat the config dir as a privileged resource: 0600 on YAML files, 0700 on the directory.
  • events.runtime.config.reloaded payload includes agent ids and rejection reasons. Subscribers see them. Single-process deployments are fine; in multi-tenant setups, gate the events.runtime.> pattern in NATS auth.
  • Outbound allowlist scope. The Phase 16 outbound allowlist governs WhatsApp + Telegram tools only. Google tools are gated by the OAuth scopes granted at credential creation (see Per-agent credentials) — there is no per-recipient list for Google.
  • Apply-on-next-message and tightening reloads. A reload that narrows an allowlist for security reasons does not affect in-flight sessions until they next receive an event. If you need the change to take effect immediately, restart the daemon (or wait for the upcoming agent reload --kick-sessions flag in Phase 19).

Failure modes

  • Bad YAML: AppConfig::load fails. Old snapshot keeps serving. config_reload_rejected_total bumps. The warn log names the file + line.
  • Validation errors: aggregate — every problem across every agent shows in one warn block. Fix them in one edit instead of restart-and-repeat.
  • Unknown provider: rejected at boot + at reload by KnownProviders check. Boot validation lists what's registered.
  • Missing tool in binding's allowed_tools: caught by the post-registry validation pass during reload.
  • Agent added / removed: Phase 18 rejects these with a clear message; restart the daemon to reshape the fleet.

Limitations

Intentional scope gaps for Phase 18, tracked for Phase 19:

  • Add / remove agent at runtime. The coordinator rejects new ids and left-over registered handles with an actionable message. Restart needed.
  • Plugin config hot-reload (whatsapp.yaml, telegram.yaml, browser.yaml, email.yaml). Plugin daemons own I/O (QR pairing, long-polling). Reshaping them live requires a dedicated lifecycle refactor.
  • config_reloaded hook for extensions to react. Pending.
  • SIGHUP trigger as an extra UX path. Deferred — use the broker topic or the CLI.

See also