Metrics & health
Prometheus metrics on :9090/metrics, health/readiness on :8080,
admin console on 127.0.0.1:9091. Everything an operator or
orchestrator needs to decide "is the agent healthy?" without reading
logs.
Source: crates/core/src/telemetry.rs, src/main.rs.
Ports at a glance
| Port | Binding | Purpose |
|---|---|---|
:9090 | 0.0.0.0 | Prometheus /metrics scrape |
:8080 | 0.0.0.0 | Health /health, readiness /ready, WhatsApp pairing pages |
:9091 | 127.0.0.1 | Admin console (loopback only) |
Ports are not configurable yet — if you need to remap, port-forward outside the agent (Docker, k8s service).
/metrics (Prometheus)
Exposed metrics:
| Name | Type | Labels | What |
|---|---|---|---|
llm_requests_total | counter | agent, provider, model | Every LLM completion request |
llm_latency_ms | histogram | agent, provider, model | Buckets 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms |
messages_processed_total | counter | agent | Inbound messages that reached an agent |
nexo_extensions_discovered | counter | status={ok,disabled,invalid} | Emitted on every discovery sweep |
nexo_tool_calls_total | counter | agent, outcome={ok,error,blocked,unknown}, tool | Tool invocations |
nexo_tool_cache_events_total | counter | agent, event={hit,miss,put,evict}, tool | Tool-level memoization |
nexo_tool_latency_ms | histogram | agent, tool | Per-tool latency |
circuit_breaker_state | gauge | breaker | 0 = Closed, 1 = Open; always includes nats |
credentials_accounts_total | gauge | channel | Per-channel labelled instance count (Phase 17) |
credentials_bindings_total | gauge | agent, channel | 1 when the agent has a credential bound, 0 otherwise |
channel_account_usage_total | counter | agent, channel, direction={inbound,outbound}, instance | Every credential use |
channel_acl_denied_total | counter | agent, channel, instance | Outbound calls rejected by allow_agents |
credentials_resolve_errors_total | counter | channel, reason | Resolver failures (unbound, not_found, not_permitted) |
credentials_breaker_state | gauge | channel, instance | 0=closed, 1=half-open, 2=open. Per-(channel, instance) circuit breaker — a 429 from one number cannot trip the breaker for a sibling account. |
credentials_boot_validation_errors_total | counter | kind | Gauntlet errors by kind at boot |
credentials_insecure_paths_total | gauge | — | Credential files with lax permissions at boot |
credentials_google_token_refresh_total | counter | account_fp, outcome={ok,err} | Google OAuth refresh attempts (fp = sha256[..8], not raw email) |
pairing_inbound_challenged_total | counter | channel, result={delivered_via_adapter,delivered_via_broker,publish_failed,no_adapter_no_broker_topic} | DM-challenge dispatch attempts (Phase 26.x) |
pairing_approvals_total | counter | channel, result={ok,expired,not_found} | nexo pair approve outcomes (Phase 26.y) |
pairing_codes_expired_total | counter | — | Setup codes pruned past TTL or rejected as expired on approve |
pairing_bootstrap_tokens_issued_total | counter | profile | Bootstrap tokens minted by BootstrapTokenIssuer::issue |
pairing_requests_pending | gauge | channel | Pending pairing requests (push-tracked; PairingStore::refresh_pending_gauge exposed for drift recovery after a daemon restart) |
Circuit-breaker state for the nats breaker is sampled at scrape
time from broker readiness, so a stalled publish path shows up in
the next scrape without needing an eager push.
The credentials_* and channel_* series are documented with full
schema examples in config/credentials.md.
account_fp is always an 8-byte sha256 fingerprint of the account id,
never the raw JID or email, so scraped metrics stay safe to share.
Useful alerts
LLM provider flapping
- alert: LlmError5xxHigh
expr: sum(rate(llm_requests_total{outcome="error"}[5m])) by (provider) > 0.1
for: 5m
NATS circuit open
- alert: NatsBreakerOpen
expr: circuit_breaker_state{breaker="nats"} == 1
for: 1m
Tool call failures
- alert: ToolErrorSpike
expr: |
sum(rate(nexo_tool_calls_total{outcome="error"}[5m])) by (tool) > 0.5
for: 10m
Health endpoints
flowchart LR
GET1[GET /health] --> OK[200 OK<br/>always<br/>{status:ok}]
GET2[GET /ready] --> CHK{broker ready<br/>AND agents > 0?}
CHK -->|yes| RDY[200 OK<br/>{status:ready,<br/>agents_running:N}]
CHK -->|no| NOT[503 Service Unavailable<br/>{status:not_ready,<br/>broker_ready,<br/>agents_running}]
GET /health— liveness probe. Returns 200 as long as the process is accepting connections. Don't use this as a traffic gate.GET /ready— readiness probe. Returns 200 only when the broker is ready and at least one agent runtime is attached to inbound topics. Returns 503 during boot, shutdown, or broker outage.GET /whatsapp/*— QR pairing pages and the/whatsapp/pairtunnel endpoint; see WhatsApp plugin.
Kubernetes probes
livenessProbe:
httpGet: { path: /health, port: 8080 }
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet: { path: /ready, port: 8080 }
initialDelaySeconds: 30
periodSeconds: 5
initialDelaySeconds: 30 for readiness covers extension discovery
and every agent runtime attaching its subscriptions.
Admin console (:9091)
Loopback-only. Exposes:
| Path | Purpose |
|---|---|
/admin/agents | Agent directory with live status, session counts |
/admin/tool-policy | Query the tool-policy registry |
The agent status [--endpoint URL] [--agent-id ID] [--json] CLI
subcommand hits this endpoint and prints a table or JSON; good for
scripting ops without grepping logs.
Remote access requires an explicit tunnel — the port is never exposed publicly by default.
Scrape config sample
# prometheus.yml
scrape_configs:
- job_name: nexo-rs
scrape_interval: 15s
static_configs:
- targets: ['agent:9090']
For Docker compose: the service name is agent. For k8s: use the
service DNS.
Gotchas
circuit_breaker_stateonly labels per-breaker, not per-provider. Multiple LLM providers each have their own breaker instance, but they surface as distinctbreakerlabel values. If you expected{provider="anthropic"}you'll need a label rename in your Prometheus relabel config.- Histograms are non-configurable. Buckets are compiled in. If your SLO requires fine-grained buckets below 50 ms, it is worth opening an issue.
/ready503 during shutdown is expected. Don't alert on 5 s of 503 bursts — alert onrate(> 30 s).