Compression Language Packs
Caveman compression can load language-specific rule packs in addition to the built-in English rules. This keeps the core engine stable while allowing Portuguese, Spanish, German, French, Japanese, and future language packs to evolve independently.
Location
Language packs live under:
open-sse/services/compression/rules/<language>/Current shipped packs (verified against rules/ directory contents):
| Language | Directory | Rule categories present |
|---|---|---|
| English | rules/en/ | context, dedup, filler, structural, ultra |
| Spanish | rules/es/ | context, dedup, filler, structural, ultra |
| Portuguese (Brazil) | rules/pt-BR/ | context, dedup, filler, structural, ultra |
| German | rules/de/ | context, filler, structural |
| French | rules/fr/ | context, filler, structural |
| Japanese | rules/ja/ | context, filler, structural |
Parity note:
en,es, andpt-BRpacks have the full 5 categories;de,fr,jaship 3 categories. The missingdedupandultracategories silently fall back to the English built-ins. Contributions welcome to adddedup.jsonandultra.jsonfor the smaller packs.The
pt-BRpack is based on Troglodita by Lenine Júnior — a compression system designed from scratch for Brazilian Portuguese grammar (pleonasm reduction, PT-BR filler removal, technical abbreviations for the dev BR community).The canonical category list and per-category schema live in
open-sse/services/compression/rules/_schema.json(JSON Schema draft 2020-12).
Language Detection
languageDetector.ts uses lightweight heuristics to infer the language from prompt text. The
configured default language is still respected, and detection can be disabled by config when exact
control is required.
Detection output is used only to choose rule packs. It does not change provider routing, locale selection, or UI language.
Config Shape
Compression settings can include:
{
"languageConfig": {
"enabled": true,
"defaultLanguage": "en",
"autoDetect": true,
"enabledPacks": ["en", "pt-BR", "es", "de", "fr", "ja"]
},
"cavemanConfig": {
"language": "en",
"autoDetectLanguage": true,
"enabledLanguagePacks": ["en", "pt-BR", "es", "de", "fr", "ja"]
}
}languageConfig controls dashboard/preview defaults. cavemanConfig is the runtime engine config
used when Caveman compresses message text.
Adding a Language Pack
- Create
open-sse/services/compression/rules/<language>/<pack>.json. - Use the Caveman rule format from
docs/compression/COMPRESSION_RULES_FORMAT.md. - Keep replacements conservative and avoid changing code, identifiers, URLs, or JSON.
- Add or update tests for language selection and replacement behavior.
- Expose new dashboard/i18n labels if the language appears in UI selectors.
API
Available packs can be queried with:
curl http://localhost:20128/api/compression/language-packsThe preview endpoint accepts language config overrides:
curl -X POST http://localhost:20128/api/compression/preview \
-H "Content-Type: application/json" \
-d '{
"mode": "standard",
"text": "Por favor, eu gostaria que voce basicamente resumisse isso.",
"config": {
"languageConfig": {
"defaultLanguage": "pt-BR",
"autoDetect": true
}
}
}'SHARED_BOUNDARIES (v3.8.0)
All 6 language packs received a SHARED_BOUNDARIES clause in v3.8.0 that is applied at every
Caveman intensity (LITE, FULL, ULTRA). It instructs the engine to preserve these patterns verbatim,
regardless of surrounding filler removal:
| Pattern type | Example |
|---|---|
| Fenced code blocks | ```python\n...\n``` |
| Inline code | `my_var` |
| URLs | https://example.com/path |
| File paths (absolute + relative) | /etc/hosts, ./src/index.ts |
| Error headers | Error:, TypeError:, SyntaxError: |
| Stack trace lines | at functionName (file.ts:12:3) |
These patterns are populated in DEFAULT_CAVEMAN_CONFIG.preservePatterns (previously []). The
constant lives in open-sse/services/compression/types.ts.
Why this matters
Without SHARED_BOUNDARIES, aggressive Caveman modes could strip content that looked like repetitive prose but was actually a code snippet, file path, or error stack. SHARED_BOUNDARIES acts as a language-agnostic safety net applied before filler rules run.
Customizing preservePatterns
Additional patterns can be added at runtime via compression settings:
{
"cavemanConfig": {
"preservePatterns": [
"```[\\s\\S]*?```",
"`[^`]+`",
"https?://\\S+",
"(?:/|\\./)[^\\s]+",
"\\b(?:Error|TypeError|SyntaxError|RangeError):",
"\\s+at\\s+\\S+\\s+\\(\\S+:\\d+:\\d+\\)"
]
}
}Custom patterns extend (not replace) the 6 defaults.
Operational Notes
- English built-in rules remain the fallback when a language pack is missing.
- Invalid built-in JSON packs fail validation so release assets do not silently degrade.
- Rule packs are data-only and should not import code or run arbitrary logic.
- The compression analytics layer records the selected mode and engine, not full prompt text.