Health checks
Three layers of health probes for a Nexo deployment, each tuned for a different consumer:
/health— liveness. Cheap (atomic flag check). HTTP 200 means the process is up; doesn't guarantee it can serve work./ready— readiness. Expensive (verifies broker connection, agents loaded, snapshot warm). HTTP 200 means the runtime can accept inbound traffic. Use this for load-balancer health checks.scripts/nexo-health.sh— operator + monitoring. JSON summary with counter snapshots. Bridge untilnexo doctor health(Phase 44) ships.
Liveness — /health
Returns HTTP 200 + ok body when the agent process is alive.
The runtime sets a RUNNING flag at startup and clears it on
graceful shutdown. Does not verify any subsystem — useful
for "is the daemon there at all" probes.
curl -fsSL http://127.0.0.1:8080/health
# ok
Kubernetes liveness probe:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
A failing liveness probe should restart the container. Be
generous on initialDelaySeconds — first-boot extension
discovery + memory open + agent runtime spin-up can take 15-25s.
Readiness — /ready
Returns 200 only when all of:
- Broker (NATS or local) is reachable
- Every configured agent has loaded its tool registry
- The hot-reload snapshot has been warmed (Phase 18)
- Pairing store is open (if
pairing_policy.auto_challengeis on)
Returns 503 with a JSON body listing the failing subsystem otherwise:
{
"ready": false,
"reasons": [
{"subsystem": "broker", "detail": "nats://localhost:4222: connection refused"}
]
}
Use this for load-balancer / service-mesh routing decisions.
A node that's live but not ready should not receive
traffic.
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 1
Operator one-shot — scripts/nexo-health.sh
Single-shot JSON summary intended for watch -n 5 nexo-health.sh during ops, cron health-mailers, and uptime
monitors that want one structured payload covering everything.
# Default — pretty human output
scripts/nexo-health.sh
# JSON only (cron, monitoring scrapers)
scripts/nexo-health.sh --json
# Custom hosts (e.g., probing through a service mesh)
scripts/nexo-health.sh --host nexo.internal:8080 \
--metrics-host nexo.internal:9090
# Strict mode — open circuit breaker counts as unhealthy.
# Default mode tolerates breaker-open (degraded-but-up).
scripts/nexo-health.sh --strict
Pretty output:
============================================================
nexo-rs health · 2026-04-26T15:30:00Z
============================================================
overall: ok
admin: 127.0.0.1:8080
metrics: 127.0.0.1:9090
probes:
✓ live ok
✓ ready ok
✓ metrics ok
counters:
tool_calls_total 4711
llm_stream_chunks_total 28391
web_search_breaker_open_total 0
JSON shape (for monitoring scrapers):
{
"overall": "ok",
"timestamp": "2026-04-26T15:30:00Z",
"endpoints": { "admin": "127.0.0.1:8080", "metrics": "127.0.0.1:9090" },
"probes": [
{"name": "live", "status": "ok", "detail": "ok"},
{"name": "ready", "status": "ok", "detail": "{...}"},
{"name": "metrics", "status": "ok", "detail": "# HELP nexo_..."}
],
"counters": {
"tool_calls_total": 4711,
"llm_stream_chunks_total": 28391,
"web_search_breaker_open_total": 0
}
}
Exit codes:
0— overall healthy1— at least one probe failed (or--strictand a breaker is open)
Cron health mailer
# /etc/cron.d/nexo-health
*/5 * * * * nexo /opt/nexo-rs/scripts/nexo-health.sh --json --strict \
>> /var/log/nexo-rs/health.jsonl 2>&1 \
|| (tail -1 /var/log/nexo-rs/health.jsonl | mail -s "nexo unhealthy" ops@yourorg)
Five-minute resolution, one line of JSONL per check, mail on failure.
Uptime monitor integration
UptimeRobot / BetterStack / Pingdom:
URL: https://nexo.example.com/ready
Interval: 60s
Timeout: 5s
Expected: HTTP 200
That's all most monitors need. The JSON body of /ready
explains the failure when the alert fires.
What nexo-health.sh adds beyond /ready
| Signal | /ready | nexo-health.sh |
|---|---|---|
| Process up + accepting traffic | ✅ | ✅ |
| Counter snapshot (tool calls, LLM chunks) | ❌ | ✅ |
| Web-search breaker state | ❌ | ✅ |
| Single JSON payload | ❌ (HTTP 200/503) | ✅ |
| Suitable for HTTP probe | ✅ | ❌ (shells out) |
Use /ready for the orchestrator. Use nexo-health.sh for the
operator's eyeballs and the alerting pipeline.
Status
Tracked as Phase 44 — Auxiliary observability surfaces.
| Capability | Status |
|---|---|
/health liveness endpoint | ✅ shipped (Phase 9) |
/ready readiness endpoint | ✅ shipped (Phase 9) |
scripts/nexo-health.sh operator one-shot | ✅ shipped |
| Operator runbook (this page) | ✅ shipped |
nexo doctor health aggregating subcommand | ⬜ deferred |
nexo inspect <session_id> state-transition pretty-print | ⬜ deferred |
Per-session structured event log under data/events/ | ⬜ deferred |