🗜️ Prompt Compression Guide — OmniRoute
Save 15-95% on eligible context automatically. For a quick overview, see the README Compression section.
Overview
OmniRoute implements a modular prompt compression pipeline that runs proactively before requests hit upstream providers. This means your token savings happen transparently — no changes needed to your workflow.
Client Request
→ Compression Strategy Selector
→ Combo override? → Use combo setting
→ Auto-trigger threshold? → Use auto mode
→ Default mode? → Use global setting
→ Off? → Skip compression
→ Selected Compression Mode
→ Off: No compression
→ Lite: Safe whitespace/formatting cleanup (~15%)
→ Standard: Caveman-speak filler removal (~30%)
→ Aggressive: History aging + summarization (~50%)
→ Ultra: Heuristic pruning + code-block thinning (~75%)
→ RTK: Command-aware terminal/tool-output filtering (60-90% upstream range)
→ Stacked: Ordered multi-engine pipeline, usually RTK then Caveman (78-95% eligible range)
→ Compressed Request → ProviderCompression Modes
Off
No compression applied. All messages pass through unchanged.
Lite Mode (~15% savings, <1ms latency)
The safest mode — zero semantic change, only formatting cleanup:
| Technique | Description |
|---|---|
collapseWhitespace | Merge consecutive blank lines and trailing spaces |
dedupSystemPrompt | Remove duplicate system messages |
compressToolResults | Compress verbose tool/function outputs |
removeRedundantContent | Strip repeated instructions |
replaceImageUrls | Shorten base64 image data URIs |
Best for: Always-on usage, safety-critical workflows.
Standard Mode (~30% savings)
Inspired by Caveman — removes filler words and verbose phrasing while preserving meaning:
- Removes filler words ("please", "I think", "basically", "actually")
- Condenses verbose phrases ("in order to" → "to", "as a result of" → "because")
- Strips polite hedging ("Would you mind...", "If you could possibly...")
- 30+ regex rules tuned for coding prompts
Best for: Daily coding workflows, cost-conscious teams.
Aggressive Mode (~50% savings)
Smart history management for long sessions:
- Message Aging — older messages get progressively compressed
- Tool Result Summarization — long tool outputs replaced with summaries
- Structural Integrity Guards — ensures
tool_use+tool_resultpairs stay consistent - Context Window Awareness — respects per-model token limits
Best for: Extended debugging sessions, large codebases.
Ultra Mode (~75% savings)
Maximum compression for token-critical scenarios:
- Heuristic Pruning — removes messages below relevance threshold
- Code Block Thinning — compresses repetitive code examples
- Binary Search Truncation — finds optimal cut point for context window
- All Aggressive mode features included
Best for: When you're hitting context limits repeatedly.
RTK Mode (60-90% upstream range)
RTK mode is optimized for verbose tool outputs that appear in coding-agent sessions:
- Detects command/output classes such as
git status,git diff,git log, test runners, TypeScript/Vite/Webpack builds, ESLint/Biome/Prettier, npm audit/installs, Docker logs, infra output, and generic shell output - Applies JSON filter packs from
open-sse/services/compression/engines/rtk/filters/ - Ships 49 built-in filters with inline verify samples
- Removes ANSI control sequences, progress bars, repeated lines, and non-actionable noise
- Preserves failures, errors, warnings, changed files, summaries, and the tail of long output
- Supports trust-gated project filters, global filters, and optional redacted raw-output recovery
Best for: Agent sessions with shell, build, test, git, grep, and file-output transcripts.
Stacked Mode (78-95% eligible range)
Stacked mode runs multiple compression engines in a deterministic order. The default pipeline is:
RTK -> CavemanThat order keeps terminal/tool output compact first, then applies Caveman semantic condensation to the remaining natural-language prompt. Stacked pipelines can be configured globally or through compression combos assigned to routing combos.
Best for: Mixed context with large tool logs plus human instructions or assistant summaries.
Upstream Savings Math
OmniRoute documents compression savings from two sources: upstream project benchmarks and OmniRoute's own engine composition.
| Source | Upstream README number used here |
|---|---|
| Caveman | ~75% fewer output tokens, 65% benchmark average output savings, 22-87% range, and ~46% input compression tool |
| RTK | 60-90% command-output savings; sample session ~118,000 -> ~23,900 tokens, or 79.7% saved (~80%) |
For overlapping tool/context payloads, the default OmniRoute combo stacks the engines:
RTK -> CavemanThe combined savings are multiplicative, not additive:
combined = 1 - (1 - RTK savings) * (1 - Caveman input savings)
average = 1 - (1 - 0.80) * (1 - 0.46) = 89.2%
range = 1 - (1 - 0.60..0.90) * (1 - 0.46) = 78.4-94.6%That 78-95% number applies when both RTK and Caveman can reduce the same input/context payload.
Caveman response output mode is separate: when enabled, use Caveman's own output savings (65%
average, ~75% headline, 22-87% range). Total billing savings depend on your prompt/output mix.
Token Savings Visualization
Without compression: 47K tokens sent to LLM
With Lite: 40K tokens sent (15% saved — safe, always-on)
With Standard: 33K tokens sent (30% saved — caveman-speak rules)
With Aggressive: 24K tokens sent (50% saved — aging + summarization)
With Ultra: 12K tokens sent (75% saved — heuristic pruning)
With RTK: 19K-5K tokens sent (60-90% saved on command/tool output)
With Stacked: 10K-2.5K tokens sent (78-95% eligible RTK+Caveman range)Configuration
Dashboard
Navigate to Dashboard → Context & Cache:
- Caveman — mode selection, language packs, preview, and global defaults
- RTK — command-filter preview, RTK safety settings, and filter catalog
- Compression Combos — named engine pipelines assigned to routing combos
- Auto-Trigger Threshold — automatically engage compression when token count exceeds threshold
Per-Combo Override
In Dashboard → Context & Cache → Compression Combos, assign a compression combo to a routing
combo:
Combo: "free-forever"
Compression Combo: "coding-agent-stack"
Pipeline: RTK -> Caveman
Targets:
1. gc/gemini-3-flash
2. if/kimi-k2-thinkingThis lets you use stacked compression on free/coding providers while keeping lite mode on paid subscriptions.
API
# Get compression settings
curl http://localhost:20128/api/settings/compression
# Update compression settings
curl -X PUT http://localhost:20128/api/settings/compression \
-H "Content-Type: application/json" \
-d '{"defaultMode":"stacked","autoTriggerMode":"stacked","autoTriggerTokens":32000}'
# Preview a specific RTK/stacked payload
curl -X POST http://localhost:20128/api/compression/preview \
-H "Content-Type: application/json" \
-d '{"mode":"rtk","messages":[{"role":"tool","content":"npm test output here"}]}'
# List RTK filter packs
curl http://localhost:20128/api/context/rtk/filters
# Test RTK directly with optional command metadata
curl -X POST http://localhost:20128/api/context/rtk/test \
-H "Content-Type: application/json" \
-d '{"command":"npm test","text":"FAIL tests/example.test.ts\nError: boom"}'What Gets Protected
The compression engine always preserves:
- ✅ Code blocks (fenced and inline)
- ✅ URLs and file paths
- ✅ JSON structures and structured data
- ✅ Identifiers and protected technical tokens
- ✅ Mathematical expressions
- ✅ Tool/function call definitions
- ✅ System prompts (in lite mode)
RTK raw-output recovery redacts common API keys, bearer tokens, Slack tokens, AWS access keys, passwords, tokens, and secrets before anything is persisted.
Compression Stats
Every compressed request includes stats in the server logs:
{
"originalTokens": 47200,
"compressedTokens": 40120,
"savingsPercent": 15.0,
"techniquesUsed": ["collapseWhitespace", "dedupSystemPrompt"],
"mode": "lite",
"engine": "caveman",
"compressionComboId": "coding-agent-stack",
"durationMs": 0.8,
"rtkRawOutputPointers": []
}Phase Roadmap
| Phase | Modes | Status |
|---|---|---|
| Phase 1 | Off, Lite | ✅ Shipped |
| Phase 2 | Standard, Aggressive, Ultra | ✅ Shipped |
| Phase 3 | RTK, Stacked, Compression Combos | ✅ Shipped |
| Phase 4 | Per-model adaptive, ML-based pruning | 🗓️ Planned |
Acknowledgments
Standard mode compression rules are inspired by Caveman by JuliusBrussee (⭐ 51K+) — the viral "why use many token when few token do trick" project. Caveman reports ~75% fewer output tokens, 65% benchmark average output savings, a 22-87% output range, and a ~46% input-compression tool.
RTK mode is inspired by RTK - Rust Token Killer by RTK AI — the high-performance command-output compression project for terminal, build, test, git, and tool-output filtering. RTK reports 60-90% savings, with its README sample session showing ~80% saved.
Advanced Compression Systems
Beyond the 7 standard modes, OmniRoute includes several advanced compression systems that work automatically based on context.
Cache-Aware Compression
Some providers (like Anthropic with prompt caching) support prompt caching, which lets them cache parts of the prompt to reduce costs and latency. When caching is enabled, aggressive compression can actually hurt performance because it changes the cached tokens, invalidating the cache.
The cachingAware.ts module solves this by detecting caching context and
adjusting the compression strategy accordingly.
How it works
- Detect caching context — Scans the request body for
cache_controlmarkers - Identify caching providers — Checks if the target provider supports caching
- Adjust strategy — Downgrades
aggressive/ultratostandardfor caching providers - Skip system prompt — System prompts are usually cached, so don't compress them
- Use deterministic transformations — Only use transformations that produce consistent output
Code example
import { detectCachingContext, getCacheAwareStrategy } from "@omniroute/open-sse/services/compression/cachingAware";
const body = {
model: "anthropic/claude-sonnet-4.5",
messages: [{ role: "user", content: "Hello" }],
cache_control: { type: "ephemeral" }, // ← Cache marker
};
const ctx = detectCachingContext(body, { provider: "anthropic" });
// → { hasCacheControl: true, provider: "anthropic", isCachingProvider: true }
const strategy = getCacheAwareStrategy("aggressive", ctx);
// → { strategy: "standard", skipSystemPrompt: true, deterministicOnly: true }When to use
Cache-aware compression is always on — no configuration needed. It only kicks in when:
- The request has
cache_controlmarkers - The target provider supports prompt caching (Anthropic, OpenAI, etc.)
Progressive Aging
Long conversations accumulate many message turns, but older turns become less
relevant. The progressiveAging.ts module degrades messages by turn distance:
- Recent turns (0-3): Kept verbatim (full detail)
- Medium turns (4-8): Lite compression (whitespace, formatting cleanup)
- Old turns (9+): Caveman compression (filler removal, summarization)
- Very old turns (20+): Heavily summarized or dropped
Code example
import { applyAging } from "@omniroute/open-sse/services/compression/progressiveAging";
const messages = [
{ role: "system", content: "You are a helpful assistant" },
{ role: "user", content: "What is 2+2?" },
{ role: "assistant", content: "4" },
// ... 50 more turns ...
];
const { messages: aged, saved } = applyAging(messages, {
verbatim: 3, // First 3 turns: verbatim
light: 8, // Turns 4-8: lite compression
moderate: 20, // Turns 9-20: caveman compression
// Turns 21+: heavy summarization
});
// saved = number of tokens savedWhen to use
Progressive aging is always on for aggressive and ultra modes. It's
particularly effective for:
- Long-running coding sessions
- Multi-day conversations
- Agentic workflows with many tool calls
Caveman Output Mode
The outputMode.ts module injects system prompt instructions to make the
model itself produce compressed, terse output (a "caveman" style).
How it works
Instead of compressing the input, this mode adds a system prompt like:
"Reply in minimal words. Skip pleasantries. Use short sentences."
This works particularly well for:
- Code generation (terser output = fewer tokens)
- Quick Q&A (no need for elaborate explanations)
- Batch processing (maximize throughput)
When to use
Caveman output mode is opt-in — set it via the combo config:
{
"strategy": "auto",
"config": {
"auto": {
"outputMode": "caveman"
}
}
}Tool Result Compression
The toolResultCompressor.ts module provides 5 specialized compression strategies
for tool results (function calls, agent outputs, search results, etc.):
- Search result compression — Removes redundant results, keeps top-N
- File read compression — Truncates large files, preserves headers/imports
- Code execution compression — Keeps only essential stdout/stderr
- Database query compression — Limits rows, removes verbose metadata
- API response compression — Strips null fields, condenses arrays
When to use
Tool result compression is always on when tool calls are present. No configuration needed.
Stacked Pipeline
The stacked mode runs multiple engines in sequence — usually RTK first (60-90% savings on tool output), then Caveman (30% additional savings on the remaining text). This achieves 78-95% total savings.
How it works
Input (1000 tokens)
→ RTK (command-aware filter) → 200 tokens
→ Caveman (filler removal) → 140 tokens
→ Output (140 tokens, 86% savings)When to use
Use stacked mode for:
- Tool-heavy workflows (agentic coding, research)
- Cost-sensitive batch processing
- When you need maximum token savings
Configure via combo:
{
"strategy": "auto",
"config": {
"auto": {
"modePack": "stacked"
}
}
}Compression Combo Overrides
You can override the global compression mode per combo to fine-tune behavior for different use cases:
{
"id": "coding-combo",
"strategy": "priority",
"config": {
"auto": {
"weights": { "taskFit": 0.5 },
"modePack": "quality-first"
}
},
"compressionOverride": {
"mode": "aggressive",
"stackedPipelines": ["rtk", "caveman"],
"preserveToolDefinitions": true
}
}This is useful for:
- Coding combos: Use
aggressivemode for long sessions - Quick Q&A combos: Use
litemode for fast responses - Tool-heavy combos: Use
stackedmode for max savings - Production combos: Use
cache-awaremode for caching providers
See Also
- Environment Config — Compression environment variables
- Architecture Guide — Compression pipeline internals
- User Guide — Getting started with compression
- RTK Compression — RTK filters, trust model, verify gate, raw-output recovery
- Compression Engines — Caveman, RTK, stacked, APIs, MCP, dashboard
- Compression Rules Format — JSON rule-pack format
- Compression Language Packs — Language-specific Caveman rules