Dead-letter queue operations
The DLQ is where events end up when they exhaust their retry budget or fail to deserialize at all. The runtime never silently drops an event — if it can't be delivered, it lands here for an operator to inspect or replay.
Source: crates/broker/src/disk_queue.rs, src/main.rs
(agent dlq ... subcommands).
When items land there
flowchart LR
PUB[publish event] --> NATS{NATS up?}
NATS -->|yes| OK[delivered]
NATS -->|no| DQ[pending_events]
DQ --> DRAIN[disk queue drain]
DRAIN -->|attempts < 3| DQ
DRAIN -->|attempts >= 3| DLQ[dead_letters]
DQ -.->|deserialization error| DLQ
- 3 attempts (
DEFAULT_MAX_ATTEMPTS) without success → row moves todead_letters - Unparseable payload → moves immediately (a poison pill is not worth retrying)
- Circuit-breaker-open on publish counts as an attempt — if the breaker stays open, the queue will eventually flush into DLQ
See Fault tolerance for the full retry flow.
The DeadLetter row
#![allow(unused)] fn main() { struct DeadLetter { id: String, // UUID topic: String, // NATS subject payload: String, // JSON event body failed_at: i64, // unix timestamp (ms) reason: String, // error text } }
Storage: SQLite table dead_letters in the broker DB (typically
./data/queue/broker.db).
CLI
agent dlq list # list up to 1000 entries
agent dlq replay <id> # move one entry back to pending_events
agent dlq purge # delete every entry
list output
Columns: id | topic | failed_at | reason. Plain text, one entry per
line, suitable for grep / awk piping.
2f9c2e4a-... plugin.inbound.whatsapp 2026-04-24T17:22:13Z circuit breaker open
b1a3a9f5-... plugin.outbound.telegram 2026-04-24T17:23:01Z deserialization error: unexpected field `...`
replay
Moves the row back to pending_events with attempts = 0:
$ agent dlq replay 2f9c2e4a-...
replayed 2f9c2e4a-... → pending_events (next daemon drain will retry it)
The retry happens on the next drain() cycle of the running agent —
replay itself does not attempt delivery. That way a running agent
in a different shell picks it up; a stopped agent leaves the event
safely in pending_events for its next startup.
purge
Destructive. Drops every row in dead_letters:
$ agent dlq purge
purged 42 dead-letter entries
Use with care — there is no per-topic filter. If you need a scoped
purge, inspect with list, selectively replay what you want to
keep, then purge the rest.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Failure (event not found for replay, DB access error, etc.) |
Common workflows
Post-outage triage
# See what piled up during the NATS outage
agent dlq list | wc -l
# Spot-check
agent dlq list | head
agent dlq list | awk '{print $2}' | sort | uniq -c
# If reasons look transient (circuit open, timeouts):
agent dlq list | awk '{print $1}' | while read id; do
agent dlq replay "$id"
done
Poison-pill cleanup
If reason mentions deserialization errors, the payload is malformed
— no amount of retry will help. Collect the offenders, fix the
producer side, then:
agent dlq list | grep deserialization | awk '{print $1}' > /tmp/poison.txt
# ... verify they're truly poison ...
agent dlq purge
Preview without modifying
The CLI has no --dry-run flag today. Use agent dlq list to preview
first; the DB rows are stable until you explicitly replay or
purge.
Monitoring
There is no dedicated DLQ metric yet. Approximations:
- A spike in
circuit_breaker_state{breaker="nats"} == 1time strongly predicts DLQ growth — alert on it. - Consider wrapping
agent dlq list | wc -lin a cron job that pushes the count to Prometheus via the textfile collector if you want a direct gauge.
Gotchas
replaydoesn't wake a stopped agent. If no agent is running against the same data directory, the row just moves back topending_eventsand waits for the next startup drain.- No replay deduplication. Replaying an event that was already successfully delivered later will deliver it again. If your consumer isn't idempotent, spot-check downstream state before replaying.
purgeis global. Scope it withlist | replayselectively if you need to preserve a subset.