Config hot-reload

Operators rotate per-agent knobs (allowlists, model strings, prompts, rate limits, delegation gates) without restarting the daemon. Sessions currently handling a message finish their turn on the old snapshot; the next event picks up the new one (apply-on-next-message). Plugin configs (whatsapp.yaml, telegram.yaml, …) are not hot-reloadable yet — see limitations.

What triggers a reload

Trigger	Source
File save under `config/`	`notify`-based watcher, debounced 500 ms
`agent reload` CLI	Publishes `control.reload` on the broker
Direct broker publish	Any integration can emit `control.reload`

What's reloaded

Files watched by default (paths relative to the config dir):

agents.yaml
agents.d/ (recursive)
llm.yaml
runtime.yaml

Extra paths listed under runtime.reload.extra_watch_paths are appended to the list.

The fields that apply live without a restart:

Field	Location	Effect
`allowed_tools` (agent + binding)	`agents.d/*.yaml`	Tool list visible to the LLM + per-call guard
`outbound_allowlist`	same	Defense-in-depth in `whatsapp_send_` / `telegram_send_`
`skills`	same	Skill blocks rendered into the system prompt
`model.model` (binding-level)	same	LLM model string on next turn
`system_prompt` + `system_prompt_extra`	same	System block composition
`sender_rate_limit`	same	Per-binding token bucket
`allowed_delegates`	same	Delegation ACL
`providers.<name>.api_key`	`llm.yaml`	Rotated via a fresh `LlmClient` on next turn

Fields that require a restart (logged as warn during reload):

id, plugins, workspace, skills_dir, transcripts_dir
heartbeat.enabled, heartbeat.interval
config.debounce_ms, config.queue_cap
model.provider (binding-level provider must match agent provider — the LlmClient is wired once per agent)
broker.yaml, memory.yaml, mcp.yaml, extensions.yaml

Adding or removing an agent also requires a restart in this release; see limitations.

Configuration

config/runtime.yaml is optional. Defaults:

reload:
  enabled: true           # master switch
  debounce_ms: 500        # notify-debouncer-full window
  extra_watch_paths: []   # appended to the built-in list

Set enabled: false to turn off the file watcher + the control.reload subscriber. The CLI agent reload still works — the daemon never opens a privileged socket, it just listens on the shared broker.

The reload pipeline

file save / CLI / broker
        │
        ▼
  debouncer (500 ms)
        │
        ▼
  AppConfig::load (YAML + env resolution)
        │
        ▼
  validate_agents_with_providers  ──fail──▶  log warn, bump
        │                                    config_reload_rejected_total,
        ▼                                    keep old snapshot
  RuntimeSnapshot::build (per agent)
        │
        ▼
  ArcSwap::store  (atomic per agent)
        │
        ▼
  events.runtime.config.reloaded

Validation failure never swaps. The daemon always serves a snapshot that passed its boot gauntlet.

CLI

# Human-readable output
$ agent reload
reload v7: applied=2 rejected=0 elapsed=18ms
  ✓ ana
  ✓ bob

# Machine-readable
$ agent reload --json
{
  "version": 7,
  "applied": ["ana", "bob"],
  "rejected": [],
  "elapsed_ms": 18
}

Exit codes:

0 — at least one agent reloaded.
1 — no control.reload.ack within 5 s (daemon not running).
2 — every agent rejected (partial-fail signal for CI).

Broker contract

Topic	Direction	Payload
`control.reload`	→ daemon	`{requested_by: string}`
`control.reload.ack`	← daemon	serialized `ReloadOutcome`

ReloadOutcome JSON shape:

{
  "version": 7,
  "applied": ["ana", "bob"],
  "rejected": [
    {"agent_id": "ana", "reason": "snapshot build: ..."}
  ],
  "elapsed_ms": 18
}

Telemetry

Metric	Type	Labels
`config_reload_applied_total`	counter	—
`config_reload_rejected_total`	counter	—
`config_reload_latency_ms`	histogram	—
`runtime_config_version`	gauge	`agent_id`

Scrape via the metrics endpoint (ops/metrics).

Apply-on-next-message semantics

A reload does not interrupt sessions that are currently handling a message. Specifically:

The LLM turn in flight keeps its captured Arc<RuntimeSnapshot> for the life of the turn — tool calls inside that turn all see the same policy, even if several reloads land during the turn.
The next event delivered to the agent reads the latest snapshot via snapshot.load() on the intake hot path.

If you need a "force-apply now" semantic (terminate in-flight sessions, respawn), use agent reload --kick-sessions — not implemented yet, tracked in Phase 19.

Security model

control.reload topic has no application-level auth. Anyone with broker publish rights can trigger a reload. In production with NATS, restrict the control.> subject pattern via NATS account permissions; see NATS with TLS + auth. The local-broker fallback is in-process only — no remote attack surface.
File-watcher trust = filesystem write. Whoever can edit config/agents.d/*.yaml can change capability surface. Treat the config dir as a privileged resource: 0600 on YAML files, 0700 on the directory.
events.runtime.config.reloaded payload includes agent ids and rejection reasons. Subscribers see them. Single-process deployments are fine; in multi-tenant setups, gate the events.runtime.> pattern in NATS auth.
Outbound allowlist scope. The Phase 16 outbound allowlist governs WhatsApp + Telegram tools only. Google tools are gated by the OAuth scopes granted at credential creation (see Per-agent credentials) — there is no per-recipient list for Google.
Apply-on-next-message and tightening reloads. A reload that narrows an allowlist for security reasons does not affect in-flight sessions until they next receive an event. If you need the change to take effect immediately, restart the daemon (or wait for the upcoming agent reload --kick-sessions flag in Phase 19).

Failure modes

Bad YAML: AppConfig::load fails. Old snapshot keeps serving. config_reload_rejected_total bumps. The warn log names the file + line.
Validation errors: aggregate — every problem across every agent shows in one warn block. Fix them in one edit instead of restart-and-repeat.
Unknown provider: rejected at boot + at reload by KnownProviders check. Boot validation lists what's registered.
Missing tool in binding's allowed_tools: caught by the post-registry validation pass during reload.
Agent added / removed: Phase 18 rejects these with a clear message; restart the daemon to reshape the fleet.

Limitations

Intentional scope gaps for Phase 18, tracked for Phase 19:

Add / remove agent at runtime. The coordinator rejects new ids and left-over registered handles with an actionable message. Restart needed.
Plugin config hot-reload (whatsapp.yaml, telegram.yaml, browser.yaml, email.yaml). Plugin daemons own I/O (QR pairing, long-polling). Reshaping them live requires a dedicated lifecycle refactor.
config_reloaded hook for extensions to react. Pending.
SIGHUP trigger as an extra UX path. Deferred — use the broker topic or the CLI.

Nexo-rs