Context optimization
Four independent mechanisms reduce the number of tokens sent to the LLM
on every request, without changing the agent's behavior. They live
under llm.context_optimization in llm.yaml and can be flipped per
agent under agents.<id>.context_optimization.
# config/llm.yaml
context_optimization:
prompt_cache:
enabled: true # default
long_ttl_providers: [anthropic, vertex]
compaction:
enabled: false # default off — opt in per agent
compact_at_pct: 0.75
tail_keep_tokens: 20000
tool_result_max_pct: 0.30
summarizer_model: "" # empty = reuse the agent's main model
lock_ttl_seconds: 300
token_counter:
enabled: true # default
backend: auto # auto | anthropic_api | tiktoken
cache_capacity: 1024
workspace_cache:
enabled: true # default
watch_debounce_ms: 500
max_age_seconds: 0 # 0 = never force refresh (notify is authoritative)
1. Prompt caching
Materializes the system prompt as a list of cache_control blocks on
the Anthropic wire so the stable prefix (workspace + skills + tool
catalog + binding glue) is billed at 0.1× input cost on every cache
hit. OpenAI / DeepSeek paths surface their automatic
prompt_tokens_details.cached_tokens field through the same
CacheUsage struct. Gemini and MiniMax flatten the blocks into the
legacy system slot today (warned once per process).
Block layout (4 cache breakpoints, the Anthropic max):
workspace— IDENTITY / SOUL / USER / AGENTS / MEMORY (Ephemeral1h)skills— per-binding skill catalog (Ephemeral1h)binding_glue— peer directory + per-binding system prompt + language directive (Ephemeral1h)channel_meta— sender id + per-turn context (Ephemeral5m)
Tools array is sorted alphabetically by name (the registry iterates a
non-deterministic DashMap) and the last tool gets a 1h
cache_control marker when cache_tools=true.
What to watch
llm_cache_read_tokens_total{agent, provider, model}— should dominatellm_cache_creation_tokens_totalafter the first turn of a warm session.llm_cache_hit_ratio{agent}— target >0.7 on multi-turn agents; <0.3 means you're paying the write premium without the discount.
When to flip off
- Provider rejects the request with a 400 mentioning
cache_control(very old model). Mitigation: the framework already strips markers forclaude-2.x; if Anthropic adds another exception, overrideANTHROPIC_CACHE_BETA="..."to disable the beta header. - A custom-built LLM gateway in front of Anthropic doesn't pass the
cache_controlfield through.
2. Compaction (online history folding)
When the pre-flight token estimate crosses compact_at_pct * effective_window, the agent runs a secondary LLM call to fold
history[..tail_start] into a single summary string. The summary
replaces the head; the last tail_keep_tokens worth of turns ride
forward verbatim. Subsequent turns prepend the summary as a synthetic
user/assistant pair so Anthropic's role-alternation rule stays valid.
Defaults are intentionally conservative: off by default. Roll out
per agent via agents.<id>.context_optimization.compaction: true.
agents:
- id: ana
context_optimization:
compaction: true # ana opts in early, others stay off
What to watch
llm_compaction_triggered_total{agent, outcome}— outcomes areok,failed,lock_held,no_boundary,tool_result_truncated.llm_compaction_duration_seconds{agent, outcome="ok"|"failed"}— a rising p99 means the summarizer model is overloaded; lowercompact_at_pctso triggers are smaller (cheaper) and more frequent.
When to flip off
- Quality regression in long sessions — the summary may be losing
active-task state. Inspect
compactions_v1rows in the SQLite store to see what was folded; bumptail_keep_tokensso more verbatim context survives. - Lock contention spikes — multiple processes (NATS multi-node) racing on the same session. The lock is per-session so this only happens with sticky-session misrouting; fix at the broker level rather than disabling compaction.
Safety nets
compaction_locks_v1carries TTL (lock_ttl_seconds) — a crashed compactor doesn't deadlock the session; the next acquire after the TTL wins automatically.- Audit log: every successful compaction inserts a row in
compactions_v1with the summary text + token cost. Inspect withsqlite3 memory.db "SELECT * FROM compactions_v1 WHERE session_id = ? ORDER BY compacted_at DESC". - Failure path: 3 retries with backoff; on total failure the original history goes to the LLM unchanged (graceful degradation, never silent data loss).
3. Token counting (pre-flight sizing)
TokenCounter trait with two backends:
- AnthropicTokenCounter — calls
POST /v1/messages/count_tokens. Exact (matches billing). LRU-cached onblake3(payload): the stable tools+identity prefix hashes the same on every turn, so the network round-trip happens ~once per process lifetime. - TiktokenCounter — offline
cl100k_baseapproximation. Drift vs Anthropic billing measured at 5–15%. Fine for budget gating, not for hard limits.
The cascade wraps the primary in a CircuitBreaker
(failure_threshold=3, 30s→300s backoff): on count_tokens outage the
agent loop falls back to tiktoken so the request still goes through.
Once the breaker has opened at least once, is_exact() flips to false
for the rest of the process so dashboards don't conflate sample
populations.
What to watch
llm_prompt_tokens_estimated{agent, provider, model}— compare againstllm_prompt_tokens_drift{...}(histogram in percent).- A drift p99 climbing past 20% means the active backend is wrong for
your model — switch from
tiktokentoanthropic_api(or vice versa for non-Anthropic providers).
When to flip off
- The agent runs against a self-hosted gateway that doesn't honor
count_tokens. Setbackend: tiktokento skip the round-trip.
4. Workspace bundle cache
Reads of IDENTITY / SOUL / USER / AGENTS / MEMORY MDs go through an
in-memory Arc<WorkspaceBundle> cache keyed by (root, scope, sorted extras). A notify-debouncer-full watcher (default 500ms) drops
every entry under a workspace root when any *.md changes. Non-MD
file changes are ignored.
What to watch
workspace_cache_hits_total{path}should dominateworkspace_cache_misses_total{path}once the cache is warm.workspace_cache_invalidations_total{path}rising without operator edits points to a tool that writes to the workspace too aggressively.
When to flip off
- NFS / FUSE filesystems where
notify(7)drops events. Setworkspace_cache.max_age_seconds: 60(or similar) to force a refresh after the absolute TTL even without a watch event.
Per-agent overrides
The four enables — and only the enables — can be flipped per agent in
agents.yaml. The numeric knobs (compact_at_pct, tail_keep_tokens,
watch_debounce_ms, …) stay global to keep the surface narrow.
agents:
- id: ana
context_optimization:
prompt_cache: true
compaction: true
token_counter: true
workspace_cache: true
- id: bob
context_optimization:
prompt_cache: false # bob runs against a gateway that strips cache_control
Hot-reload behavior
Changing global knobs (llm.yaml) takes effect on the next request
once the reload coordinator picks up the file change (Phase 18). For
per-agent enables, the override rides on Arc<AgentConfig> inside
RuntimeSnapshot and is observed on the next
policy_for(...) lookup. The LlmAgentBehavior struct itself still
caches its compactor / prompt_cache_enabled fields at construction —
toggling those without a process restart requires the future
ArcSwap<CompactionRuntime> refactor noted in proyecto/FOLLOWUPS.md.
Rollout playbook
- Deploy with everything at defaults —
prompt_cache=true,compaction=false,token_counter=true,workspace_cache=true. - Watch
llm_cache_hit_ratiofor 24h. Expect it to climb to >0.7 on chatty agents; if it stays low, check that the workspace bundle is stable across turns (no MD writes mid-session). - Pick one agent, opt it into compaction (
agents.<id>.context_optimization.compaction: true), reload config, watch for a week. - If
llm_compaction_triggered_total{outcome="ok"}> 0 and quality feedback is positive, roll compaction out to the rest of the fleet. - If drift on
llm_prompt_tokens_driftis consistently <10%, leavetoken_counter.backend: auto. If higher, considerbackend: tiktokenfor non-Anthropic providers — saves the round-trip without losing accuracy you didn't have anyway.