Rate limiting & retry

Every LLM provider client sits behind a token bucket and a bounded retry policy with decorrelated jittered exponential backoff. This page is the definitive reference for those two mechanisms.

Source: crates/llm/src/retry.rs, crates/llm/src/rate_limiter.rs, crates/llm/src/quota_tracker.rs.

Rate limiter

Token bucket, acquired before every outbound request.

  • interval = 1 / requests_per_second
  • One token per request
  • Bucket fully refills after interval per slot
  • Per-provider, per-agent — each client has its own bucket, so one noisy agent can't starve another even when they share a provider
rate_limit:
  requests_per_second: 2.0
  quota_alert_threshold: 100000   # optional

At 2.0 rps, the bucket tops up a slot every 500 ms. A burst of 3 requests will wait briefly on the third.

Quota tracker

Optional. When a provider returns remaining-quota info (header, response body), quota_tracker records it via record_usage() on the token response. If the remaining crosses quota_alert_threshold, a structured warn log is emitted:

WARN quota threshold crossed  provider=minimax remaining=99500 threshold=100000

Pair with a Prometheus log-scraping rule for an alert.

Retry policy

Retries live above the circuit breaker. They handle transient failures that don't warrant flipping the breaker.

Error classMax attemptsBackoff curve
429 (rate limit)5max(retry-after, jittered_backoff)
5xx (server)3jittered_backoff
401 (auth)1 refresh + 1 retry(internal to the client)
Other 4xx0 (fail fast)

Decorrelated jittered backoff

Not simple exponential — the next backoff is a uniform random draw in a growing range:

next = uniform(base, max(base, last × multiplier))

Defaults from llm.yaml retry block:

FieldDefault
initial_backoff_ms1000
max_backoff_ms60000
backoff_multiplier2.0

Why decorrelated jitter: multiple clients hitting the same 429 don't re-fire in lockstep. Desynchronization is built-in.

flowchart LR
    REQ[request] --> API{API response}
    API -->|200| OK[return ChatResponse]
    API -->|429| RL[RateLimit]
    API -->|5xx| SE[ServerError]
    API -->|401| AU[CredentialInvalid]
    API -->|4xx| F[Other fail fast]

    RL --> D1{attempts<br/>< 5?}
    SE --> D2{attempts<br/>< 3?}
    AU --> REF[auth refresh<br/>+ single retry]
    D1 -->|yes| BO1[wait max(retry_after,<br/>jittered_backoff)]
    D1 -->|no| F
    D2 -->|yes| BO2[wait jittered_backoff]
    D2 -->|no| F
    BO1 --> REQ
    BO2 --> REQ
    REF --> REQ

Error classification per provider

The providers classify HTTP responses into a shared LlmError so the retry layer can be common code:

HTTPLlmError variantRetried?
200Ok(ChatResponse)
429RateLimit { retry_after_ms }✅ up to 5
5xxServerError { status, body }✅ up to 3
401 / 403CredentialInvalid❌ (client handles refresh internally)
Other 4xxOther

Tuning

  • Bursty workloads: bump requests_per_second cautiously; the upstream's own rate limits won't move, so you'll just pay more 429s to find the ceiling.
  • Flaky networks: raise max_attempts for 5xx; keep max_backoff_ms bounded so slow agents don't spiral.
  • Subscription plans: lower requests_per_second to keep daily usage under caps; pair with quota_alert_threshold.

See also