Metrics & health

Prometheus metrics on :9090/metrics, health/readiness on :8080, admin console on 127.0.0.1:9091. Everything an operator or orchestrator needs to decide "is the agent healthy?" without reading logs.

Source: crates/core/src/telemetry.rs, src/main.rs.

Ports at a glance

PortBindingPurpose
:90900.0.0.0Prometheus /metrics scrape
:80800.0.0.0Health /health, readiness /ready, WhatsApp pairing pages
:9091127.0.0.1Admin console (loopback only)

Ports are not configurable yet — if you need to remap, port-forward outside the agent (Docker, k8s service).

/metrics (Prometheus)

Exposed metrics:

NameTypeLabelsWhat
llm_requests_totalcounteragent, provider, modelEvery LLM completion request
llm_latency_mshistogramagent, provider, modelBuckets 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms
messages_processed_totalcounteragentInbound messages that reached an agent
nexo_extensions_discoveredcounterstatus={ok,disabled,invalid}Emitted on every discovery sweep
nexo_tool_calls_totalcounteragent, outcome={ok,error,blocked,unknown}, toolTool invocations
nexo_tool_cache_events_totalcounteragent, event={hit,miss,put,evict}, toolTool-level memoization
nexo_tool_latency_mshistogramagent, toolPer-tool latency
circuit_breaker_stategaugebreaker0 = Closed, 1 = Open; always includes nats
credentials_accounts_totalgaugechannelPer-channel labelled instance count (Phase 17)
credentials_bindings_totalgaugeagent, channel1 when the agent has a credential bound, 0 otherwise
channel_account_usage_totalcounteragent, channel, direction={inbound,outbound}, instanceEvery credential use
channel_acl_denied_totalcounteragent, channel, instanceOutbound calls rejected by allow_agents
credentials_resolve_errors_totalcounterchannel, reasonResolver failures (unbound, not_found, not_permitted)
credentials_breaker_stategaugechannel, instance0=closed, 1=half-open, 2=open. Per-(channel, instance) circuit breaker — a 429 from one number cannot trip the breaker for a sibling account.
credentials_boot_validation_errors_totalcounterkindGauntlet errors by kind at boot
credentials_insecure_paths_totalgaugeCredential files with lax permissions at boot
credentials_google_token_refresh_totalcounteraccount_fp, outcome={ok,err}Google OAuth refresh attempts (fp = sha256[..8], not raw email)
pairing_inbound_challenged_totalcounterchannel, result={delivered_via_adapter,delivered_via_broker,publish_failed,no_adapter_no_broker_topic}DM-challenge dispatch attempts (Phase 26.x)
pairing_approvals_totalcounterchannel, result={ok,expired,not_found}nexo pair approve outcomes (Phase 26.y)
pairing_codes_expired_totalcounterSetup codes pruned past TTL or rejected as expired on approve
pairing_bootstrap_tokens_issued_totalcounterprofileBootstrap tokens minted by BootstrapTokenIssuer::issue
pairing_requests_pendinggaugechannelPending pairing requests (push-tracked; PairingStore::refresh_pending_gauge exposed for drift recovery after a daemon restart)

Circuit-breaker state for the nats breaker is sampled at scrape time from broker readiness, so a stalled publish path shows up in the next scrape without needing an eager push.

The credentials_* and channel_* series are documented with full schema examples in config/credentials.md. account_fp is always an 8-byte sha256 fingerprint of the account id, never the raw JID or email, so scraped metrics stay safe to share.

Useful alerts

LLM provider flapping

- alert: LlmError5xxHigh
  expr: sum(rate(llm_requests_total{outcome="error"}[5m])) by (provider) > 0.1
  for: 5m

NATS circuit open

- alert: NatsBreakerOpen
  expr: circuit_breaker_state{breaker="nats"} == 1
  for: 1m

Tool call failures

- alert: ToolErrorSpike
  expr: |
    sum(rate(nexo_tool_calls_total{outcome="error"}[5m])) by (tool) > 0.5
  for: 10m

Health endpoints

flowchart LR
    GET1[GET /health] --> OK[200 OK<br/>always<br/>{status:ok}]
    GET2[GET /ready] --> CHK{broker ready<br/>AND agents > 0?}
    CHK -->|yes| RDY[200 OK<br/>{status:ready,<br/>agents_running:N}]
    CHK -->|no| NOT[503 Service Unavailable<br/>{status:not_ready,<br/>broker_ready,<br/>agents_running}]
  • GET /health — liveness probe. Returns 200 as long as the process is accepting connections. Don't use this as a traffic gate.
  • GET /ready — readiness probe. Returns 200 only when the broker is ready and at least one agent runtime is attached to inbound topics. Returns 503 during boot, shutdown, or broker outage.
  • GET /whatsapp/* — QR pairing pages and the /whatsapp/pair tunnel endpoint; see WhatsApp plugin.

Kubernetes probes

livenessProbe:
  httpGet: { path: /health, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /ready, port: 8080 }
  initialDelaySeconds: 30
  periodSeconds: 5

initialDelaySeconds: 30 for readiness covers extension discovery and every agent runtime attaching its subscriptions.

Admin console (:9091)

Loopback-only. Exposes:

PathPurpose
/admin/agentsAgent directory with live status, session counts
/admin/tool-policyQuery the tool-policy registry

The agent status [--endpoint URL] [--agent-id ID] [--json] CLI subcommand hits this endpoint and prints a table or JSON; good for scripting ops without grepping logs.

Remote access requires an explicit tunnel — the port is never exposed publicly by default.

Scrape config sample

# prometheus.yml
scrape_configs:
  - job_name: nexo-rs
    scrape_interval: 15s
    static_configs:
      - targets: ['agent:9090']

For Docker compose: the service name is agent. For k8s: use the service DNS.

Gotchas

  • circuit_breaker_state only labels per-breaker, not per-provider. Multiple LLM providers each have their own breaker instance, but they surface as distinct breaker label values. If you expected {provider="anthropic"} you'll need a label rename in your Prometheus relabel config.
  • Histograms are non-configurable. Buckets are compiled in. If your SLO requires fine-grained buckets below 50 ms, it is worth opening an issue.
  • /ready 503 during shutdown is expected. Don't alert on 5 s of 503 bursts — alert on rate(> 30 s).