Troubleshooting

For Users: Looking for quick fixes? See the Quick Reference below.

Common problems and solutions for OmniRoute.

Quick Reference

New to OmniRoute? Start here — these solve 90% of problems:

I see this	What it means	What to do
"Can't connect"	OmniRoute isn't running	Run `omniroute` or `docker restart omniroute`
"Invalid API key"	Your key is wrong or expired	Re-copy the key from the provider's website
"Rate limit exceeded"	You're sending too many requests	Wait 1 minute, or use `model: "auto"` for automatic fallback
"Quota exceeded"	You've used up your free/paid quota	Connect more providers, or use free providers (Kiro, Pollinations)
"Slow responses"	Provider is busy or far away	Use `model: "auto/fast"` or connect a faster provider (Groq, Cerebras)
"Wrong provider used"	`auto` picked a different provider	That's normal! `auto` picks the best one. Force a specific provider with `model: "openai/gpt-4o"`
"502 Bad Gateway"	Provider is down	Wait and retry, or use `model: "auto"` to switch providers
"401 Unauthorized"	Your credentials are wrong	Check your API key or re-authenticate with OAuth
"429 Too Many Requests"	Rate limited	Wait 1 minute, or connect more providers

Still stuck? See the detailed troubleshooting below, or ask on Discord.

Quick Fixes

Problem	Solution
First login not working	Set `INITIAL_PASSWORD` in `.env` (no hardcoded default)
Dashboard opens on wrong port	Set `PORT=20128` and `NEXT_PUBLIC_BASE_URL=http://localhost:20128`
No logs written to disk	Set `APP_LOG_TO_FILE=true` and verify call log capture is enabled
EACCES: permission denied	Set `DATA_DIR=/path/to/writable/dir` to override `~/.omniroute`
Routing strategy not saving	Update to the latest v3.x release (Zod schema fix for settings persistence shipped in earlier versions)
Login crash / blank page	Check Node.js version — see Node.js Compatibility below
`dlopen` / `slice is not valid mach-o file` (macOS)	Run `cd $(npm root -g)/omniroute/app && npm rebuild better-sqlite3 && omniroute` — see macOS native module rebuild below
Proxy "fetch failed"	Ensure proxy config is set at the correct level — see Proxy Issues below

Cause: You are running a Node.js version outside OmniRoute's approved secure runtime floor. The most common case is running an older Node 20, 22, or 24 patch level that falls below the patched security floor OmniRoute requires.

Symptoms:

Login page shows a blank screen or a server error
Console shows Error: Module did not self-register or similar native binding errors
The login page shows an orange warning banner with your Node version if the runtime is outside the supported secure policy

Fix:

Install a supported Node.js LTS release (recommended: Node.js 24.x):
```
nvm install 24
nvm use 24
```
Verify your version: node --version should show v24.0.0 or newer on the 24.x LTS line
Reinstall OmniRoute: npm install -g omniroute
Restart: omniroute

Supported secure versions: >=20.20.2 <21, >=22.22.2 <23, or >=24.0.0 <27. Node.js 24.x LTS (Krypton) and Node.js 26 are fully supported.

macOS: `dlopen` / "slice is not valid mach-o file"

Cause: After a global npm install -g omniroute, the better-sqlite3 native binary inside the package may have been compiled for a different architecture or Node.js ABI than what is running locally. This is common on macOS (both Apple Silicon and Intel) when the pre-built binary does not match your environment.

Symptoms:

Server fails immediately on startup with a dlopen error
Error contains slice is not valid mach-o file
Full example:

dlopen(/Users/<user>/.nvm/versions/node/v24.14.1/lib/node_modules/omniroute/app/node_modules/better-sqlite3/build/Release/better_sqlite3.node, 0x0001): tried: '...' (slice is not valid mach-o file)

Fix — rebuild for your local environment (no Node.js downgrade required):

cd $(npm root -g)/omniroute/app
npm rebuild better-sqlite3
omniroute

Note: This recompiles the native binding against your local Node.js version and CPU architecture, resolving the binary mismatch. The officially supported range is >=20.20.2 <21, >=22.22.2 <23, or >=24.0.0 <27 (engines field in package.json). Node.js 24.x LTS (Krypton) and Node.js 26 are fully supported with better-sqlite3 v12.x.

Proxy Issues

Provider validation shows "fetch failed"

Cause: The API key validation endpoint (POST /api/providers/validate) was previously bypassing proxy configuration, causing failures in environments that require proxy routing.

Fix (v3.5.5+): This is now fixed. Provider validation routes through runWithProxyContext, honoring provider-level and global proxy settings automatically.

Token health check fails with "fetch failed"

Cause: Background OAuth token refresh was not resolving proxy configuration per connection.

Fix (v3.5.5+): The token health check scheduler now resolves proxy config per connection before attempting refresh. Update to v3.5.5+.

SOCKS5 proxy returns "invalid onRequestStart method"

Cause: On Node.js 22, the undici@8 dispatcher is incompatible with Node's built-in fetch() implementation.

Fix (v3.5.5+): OmniRoute now uses undici's own fetch() function when a proxy dispatcher is active, ensuring consistent behavior. Update to v3.5.5+.

Provider Issues

"Language model did not provide messages"

Cause: Provider quota exhausted.

Fix:

Check dashboard quota tracker
Use a combo with fallback tiers
Switch to cheaper/free tier

Rate Limiting

Cause: Subscription quota exhausted.

Fix:

Add fallback: cc/claude-opus-4-6 → glm/glm-4.7 → if/kimi-k2-thinking
Use GLM/MiniMax as cheap backup

OAuth Token Expired

OmniRoute auto-refreshes tokens. If issues persist:

Dashboard → Provider → Reconnect
Delete and re-add the provider connection

Kiro multi-account: second account invalidates the first

Cause: Kiro's backend enforces a single active session per OIDC client registration. When two accounts share the same registered client (connections imported before v3.8.0), refreshing one account's token invalidates the other's refresh token.

Fix (v3.8.0+): Re-import affected connections. Starting with v3.8.0, every new Kiro connection created via Import Token, Google/GitHub social login, or Auto-Import automatically registers its own dedicated OIDC client. The connection is therefore fully isolated and refreshing one account has no effect on any other account.

Connections that were imported before v3.8.0 do not carry a per-connection client registration. Those connections continue to use the shared social-auth refresh endpoint. To gain isolation, delete the old connection from Dashboard → Providers and re-add it via any of the three import flows.

For full details and step-by-step instructions for adding two Kiro accounts side by side, see docs/guides/KIRO_SETUP.md.

Cloud Issues

Cloud Sync Errors

Verify BASE_URL points to your running instance (e.g., http://localhost:20128)
Verify CLOUD_URL points to your cloud endpoint (e.g., https://omniroute.dev)
Keep NEXT_PUBLIC_* values aligned with server-side values

Cloud `stream=false` Returns 500

Symptom: Unexpected token 'd'... on cloud endpoint for non-streaming calls.

Cause: Upstream returns SSE payload while client expects JSON.

Workaround: Use stream=true for cloud direct calls. Local runtime includes SSE→JSON fallback.

Cloud Says Connected but "Invalid API key"

Create a fresh key from local dashboard (/api/keys)
Run cloud sync: Enable Cloud → Sync Now
Old/non-synced keys can still return 401 on cloud

Docker Issues

CLI Tool Shows Not Installed

Check runtime fields: curl http://localhost:20128/api/cli-tools/runtime/codex | jq
For portable mode: use image target runner-cli (bundled CLIs)
For host mount mode: set CLI_EXTRA_PATHS and mount host bin directory as read-only
If installed=true and runnable=false: binary was found but failed healthcheck

Quick Runtime Validation

curl -s http://localhost:20128/api/cli-tools/codex-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'
curl -s http://localhost:20128/api/cli-tools/claude-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'
curl -s http://localhost:20128/api/cli-tools/openclaw-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'

Cost Issues

High Costs

Check usage stats in Dashboard → Usage
Switch primary model to GLM/MiniMax
Use free tier (Gemini CLI, Qoder) for non-critical tasks
Set cost budgets per API key: Dashboard → API Keys → Budget

Debugging

Enable Log Files

Set APP_LOG_TO_FILE=true in your .env file. Application logs are written under logs/. Request artifacts are stored under ${DATA_DIR}/call_logs/ when the call log pipeline is enabled in settings. When pipeline capture is enabled, set CALL_LOG_PIPELINE_CAPTURE_STREAM_CHUNKS=false to omit stream chunk payloads, or tune CALL_LOG_PIPELINE_MAX_SIZE_KB to change the artifact cap in KB.

Check Provider Health

# Health dashboard
http://localhost:20128/dashboard/health

# API health check
curl http://localhost:20128/api/monitoring/health

Runtime Storage

Main state: ${DATA_DIR}/storage.sqlite (providers, combos, aliases, keys, settings)
Usage: SQLite tables in storage.sqlite (usage_history, call_logs, proxy_logs) + optional ${DATA_DIR}/call_logs/
Application logs: <repo>/logs/... (when APP_LOG_TO_FILE=true)
Call log artifacts: ${DATA_DIR}/call_logs/YYYY-MM-DD/... when the call log pipeline is enabled

Circuit Breaker Issues

Provider stuck in OPEN state

When a provider's circuit breaker is OPEN, requests are blocked until the cooldown expires.

Fix:

Go to Dashboard → Settings → Resilience
Check the circuit breaker card for the affected provider
Click Reset All to clear all breakers, or wait for the cooldown to expire
Verify the provider is actually available before resetting

Provider keeps tripping the circuit breaker

If a provider repeatedly enters OPEN state:

Check Dashboard → Health → Provider Health for the failure pattern
Go to Settings → Resilience → Provider Profiles and increase the failure threshold
Check if the provider has changed API limits or requires re-authentication
Review latency telemetry — high latency may cause timeout-based failures

Audio Transcription Issues

"Unsupported model" error

Ensure you're using the correct prefix: deepgram/nova-3 or assemblyai/best
Verify the provider is connected in Dashboard → Providers

Transcription returns empty or fails

Check supported audio formats: mp3, wav, m4a, flac, ogg, webm
Verify file size is within provider limits (typically < 25MB)
Check provider API key validity in the provider card

Translator Debugging

Use Dashboard → Translator to debug format translation issues:

Mode	When to Use
Playground	Compare input/output formats side by side — paste a failing request to see how it translates
Chat Tester	Send live messages and inspect the full request/response payload including headers
Test Bench	Run batch tests across format combinations to find which translations are broken
Live Monitor	Watch real-time request flow to catch intermittent translation issues

Common format issues

Thinking tags not appearing — Check if the target provider supports thinking and the thinking budget setting
Tool calls dropping — Some format translations may strip unsupported fields; verify in Playground mode
System prompt missing — Claude and Gemini handle system prompts differently; check translation output
SDK returns raw string instead of object — Resolved in v1.x; response sanitizer strips non-standard fields (x_groq, usage_breakdown, etc.) that cause OpenAI SDK Pydantic validation failures. If you still see this on v3.x+, please file an issue.
GLM/ERNIE rejects system role — Resolved in v1.x; role normalizer automatically merges system messages into user messages for incompatible models. If you still see this on v3.x+, please file an issue.
developer role not recognized — Resolved in v1.x; automatically converted to system for non-OpenAI providers. If you still see this on v3.x+, please file an issue.
json_schema not working with Gemini — Resolved in v1.x; response_format is now converted to Gemini's responseMimeType + responseSchema. If you still see this on v3.x+, please file an issue.

Resilience Settings

Auto rate-limit not triggering

Auto rate-limit only applies to API key providers (not OAuth/subscription)
Verify Settings → Resilience → Provider Profiles has auto-rate-limit enabled
Check if the provider returns 429 status codes or Retry-After headers

Tuning exponential backoff

Provider profiles support these settings:

Base delay — Initial wait time after first failure (default: 1s)
Max delay — Maximum wait time cap (default: 30s)
Multiplier — How much to increase delay per consecutive failure (default: 2x)

Anti-thundering herd

When many concurrent requests hit a rate-limited provider, OmniRoute uses mutex + auto rate-limiting to serialize requests and prevent cascading failures. This is automatic for API key providers.

Optional RAG / LLM failure taxonomy (16 problems)

Some OmniRoute users place the gateway in front of RAG or agent stacks. In those setups it is common to see a strange pattern: OmniRoute looks healthy (providers up, routing profiles ok, no rate limit alerts) but the final answer is still wrong.

In practice these incidents usually come from the downstream RAG pipeline, not from the gateway itself.

If you want a shared vocabulary to describe those failures you can use the WFGY ProblemMap, an external MIT license text resource that defines sixteen recurring RAG / LLM failure patterns. At a high level it covers:

retrieval drift and broken context boundaries
empty or stale indexes and vector stores
embedding versus semantic mismatch
prompt assembly and context window issues
logic collapse and overconfident answers
long chain and agent coordination failures
multi agent memory and role drift
deployment and bootstrap ordering problems

The idea is simple:

When you investigate a bad response, capture:
- user task and request
- route or provider combo in OmniRoute
- any RAG context used downstream (retrieved documents, tool calls, etc)
Map the incident to one or two WFGY ProblemMap numbers (No.1 … No.16).
Store the number in your own dashboard, runbook, or incident tracker next to the OmniRoute logs.
Use the corresponding WFGY page to decide whether you need to change your RAG stack, retriever, or routing strategy.

Full text and concrete recipes live here (MIT license, text only):

WFGY ProblemMap README

You can ignore this section if you do not run RAG or agent pipelines behind OmniRoute.

v3.8.0 Known Issues

Issues specific to the v3.8.0 release and their current workarounds. If a fix lands in a later patch, the entry will be updated or removed.

Windsurf OAuth flow fails with 401

Symptoms:

"401 unauthorized" while completing the Windsurf OAuth flow from the dashboard
Windsurf provider card stays in "needs reconnection" state after the callback

Causes:

WINDSURF_FIREBASE_API_KEY env var missing or empty
WINDSURF_API_KEY misconfigured or pointing at a stale token
Local firewall/proxy blocking the OAuth callback

Fix:

Verify both WINDSURF_FIREBASE_API_KEY and WINDSURF_API_KEY are set in .env
Restart OmniRoute so the new env values are picked up
Re-run the OAuth flow from Dashboard → Providers → Windsurf → Reconnect

Devin CLI auth failures

Symptoms:

"Devin CLI not found" or "auth failed" when invoking Devin-backed tools
CLI runtime check reports installed=false

Causes:

CLI_DEVIN_BIN points to a path that does not exist
Devin CLI is not installed on the host

Fix:

Install the Devin CLI for your platform
Set CLI_DEVIN_BIN=/usr/local/bin/devin (or the real path) in .env
Restart OmniRoute and re-test from Dashboard → CLI Tools

Model cooldown stuck (manual reset)

Symptoms:

A model stays listed in cooldown even after the expiration time has passed
Requests still skip the model in combo routing despite the timestamp being in the past

Manual reset:

Dashboard: Settings → Model Cooldowns → click Re-enable on the affected card
API: DELETE /api/resilience/model-cooldowns with management auth headers

Command Code provider connection fails with 403

Symptoms:

403 when testing the Command Code provider connection
The provider card shows "unauthorized" after a fresh add

Cause: The OAuth flow did not complete (callback not received or token not persisted).

Fix:

Run omniroute providers from the CLI to re-trigger the OAuth flow, or
Re-run OAuth from Dashboard → Providers → Command Code → Reconnect

ModelScope returns aggressive 429 cooldowns

Symptoms:

Very short or immediate cooldowns on ModelScope after a small burst of requests
Combo routing skips ModelScope earlier than expected

Cause: ModelScope emits provider-specific Retry-After headers. v3.8.0 ships dedicated handling for those headers, so older versions misread them as generic rate-limit hints.

Fix:

Ensure you are on v3.8.0 or later
Verify the useUpstream429BreakerHints toggle is enabled under Settings → Resilience

OMNIROUTE_WS_BRIDGE_SECRET missing in production

Symptoms:

401 on every Codex/Responses WebSocket bridge request when running on a remote production host
WebSocket bridge handshake closes immediately after connect

Cause: The OMNIROUTE_WS_BRIDGE_SECRET env var is missing from the production environment.

Fix:

Generate a random secret: openssl rand -hex 32
Set OMNIROUTE_WS_BRIDGE_SECRET=<random-secret> in the production server env (and any client that talks to the bridge)
Restart OmniRoute

Responses API: background mode degraded to synchronous

Symptoms:

Warning logged: background mode degraded to synchronous
A background: true request returns a normal synchronous response instead of a background job handle

Cause: v3.8.0 intentionally degrades background: true on the Responses API to synchronous execution while emitting a warning. Full async background execution is a future deliverable.

Fix:

Adjust the client to call without background, or
Wait for a later release that ships full async background mode (track the changelog)

Still Stuck?

GitHub Issues: github.com/diegosouzapw/OmniRoute/issues
Architecture: See docs/architecture/ARCHITECTURE.md for internal details
API Reference: See docs/reference/API_REFERENCE.md for all endpoints
Health Dashboard: Check Dashboard → Health for real-time system status
Translator: Use Dashboard → Translator to debug format issues

Troubleshooting

For Users: Looking for quick fixes? See the Quick Reference below.

Common problems and solutions for OmniRoute.

Quick Reference

New to OmniRoute? Start here — these solve 90% of problems:

I see this	What it means	What to do
"Can't connect"	OmniRoute isn't running	Run `omniroute` or `docker restart omniroute`
"Invalid API key"	Your key is wrong or expired	Re-copy the key from the provider's website
"Rate limit exceeded"	You're sending too many requests	Wait 1 minute, or use `model: "auto"` for automatic fallback
"Quota exceeded"	You've used up your free/paid quota	Connect more providers, or use free providers (Kiro, Pollinations)
"Slow responses"	Provider is busy or far away	Use `model: "auto/fast"` or connect a faster provider (Groq, Cerebras)
"Wrong provider used"	`auto` picked a different provider	That's normal! `auto` picks the best one. Force a specific provider with `model: "openai/gpt-4o"`
"502 Bad Gateway"	Provider is down	Wait and retry, or use `model: "auto"` to switch providers
"401 Unauthorized"	Your credentials are wrong	Check your API key or re-authenticate with OAuth
"429 Too Many Requests"	Rate limited	Wait 1 minute, or connect more providers

Still stuck? See the detailed troubleshooting below, or ask on Discord.

Detailed Troubleshooting

Quick Fixes

Problem	Solution
First login not working	Set `INITIAL_PASSWORD` in `.env` (no hardcoded default)
Dashboard opens on wrong port	Set `PORT=20128` and `NEXT_PUBLIC_BASE_URL=http://localhost:20128`
No logs written to disk	Set `APP_LOG_TO_FILE=true` and verify call log capture is enabled
EACCES: permission denied	Set `DATA_DIR=/path/to/writable/dir` to override `~/.omniroute`
Routing strategy not saving	Update to the latest v3.x release (Zod schema fix for settings persistence shipped in earlier versions)
Login crash / blank page	Check Node.js version — see Node.js Compatibility below
`dlopen` / `slice is not valid mach-o file` (macOS)	Run `cd $(npm root -g)/omniroute/app && npm rebuild better-sqlite3 && omniroute` — see macOS native module rebuild below
Proxy "fetch failed"	Ensure proxy config is set at the correct level — see Proxy Issues below

Node.js Compatibility

Symptoms:

Login page shows a blank screen or a server error
Console shows Error: Module did not self-register or similar native binding errors
The login page shows an orange warning banner with your Node version if the runtime is outside the supported secure policy

Fix:

Install a supported Node.js LTS release (recommended: Node.js 24.x):
```
nvm install 24
nvm use 24
```
Verify your version: node --version should show v24.0.0 or newer on the 24.x LTS line
Reinstall OmniRoute: npm install -g omniroute
Restart: omniroute

Supported secure versions: >=20.20.2 <21, >=22.22.2 <23, or >=24.0.0 <27. Node.js 24.x LTS (Krypton) and Node.js 26 are fully supported.

macOS: `dlopen` / "slice is not valid mach-o file"

Symptoms:

Server fails immediately on startup with a dlopen error
Error contains slice is not valid mach-o file
Full example:

dlopen(/Users/<user>/.nvm/versions/node/v24.14.1/lib/node_modules/omniroute/app/node_modules/better-sqlite3/build/Release/better_sqlite3.node, 0x0001): tried: '...' (slice is not valid mach-o file)

Fix — rebuild for your local environment (no Node.js downgrade required):

cd $(npm root -g)/omniroute/app
npm rebuild better-sqlite3
omniroute

Note: This recompiles the native binding against your local Node.js version and CPU architecture, resolving the binary mismatch. The officially supported range is >=20.20.2 <21, >=22.22.2 <23, or >=24.0.0 <27 (engines field in package.json). Node.js 24.x LTS (Krypton) and Node.js 26 are fully supported with better-sqlite3 v12.x.

Proxy Issues

Provider validation shows "fetch failed"

Cause: The API key validation endpoint (POST /api/providers/validate) was previously bypassing proxy configuration, causing failures in environments that require proxy routing.

Fix (v3.5.5+): This is now fixed. Provider validation routes through runWithProxyContext, honoring provider-level and global proxy settings automatically.

Token health check fails with "fetch failed"

Cause: Background OAuth token refresh was not resolving proxy configuration per connection.

Fix (v3.5.5+): The token health check scheduler now resolves proxy config per connection before attempting refresh. Update to v3.5.5+.

SOCKS5 proxy returns "invalid onRequestStart method"

Cause: On Node.js 22, the undici@8 dispatcher is incompatible with Node's built-in fetch() implementation.

Fix (v3.5.5+): OmniRoute now uses undici's own fetch() function when a proxy dispatcher is active, ensuring consistent behavior. Update to v3.5.5+.

Provider Issues

"Language model did not provide messages"

Cause: Provider quota exhausted.

Fix:

Check dashboard quota tracker
Use a combo with fallback tiers
Switch to cheaper/free tier

Rate Limiting

Cause: Subscription quota exhausted.

Fix:

Add fallback: cc/claude-opus-4-6 → glm/glm-4.7 → if/kimi-k2-thinking
Use GLM/MiniMax as cheap backup

OAuth Token Expired

OmniRoute auto-refreshes tokens. If issues persist:

Dashboard → Provider → Reconnect
Delete and re-add the provider connection

Kiro multi-account: second account invalidates the first

For full details and step-by-step instructions for adding two Kiro accounts side by side, see docs/guides/KIRO_SETUP.md.

Cloud Issues

Cloud Sync Errors

Verify BASE_URL points to your running instance (e.g., http://localhost:20128)
Verify CLOUD_URL points to your cloud endpoint (e.g., https://omniroute.dev)
Keep NEXT_PUBLIC_* values aligned with server-side values

Cloud `stream=false` Returns 500

Symptom: Unexpected token 'd'... on cloud endpoint for non-streaming calls.

Cause: Upstream returns SSE payload while client expects JSON.

Workaround: Use stream=true for cloud direct calls. Local runtime includes SSE→JSON fallback.

Cloud Says Connected but "Invalid API key"

Create a fresh key from local dashboard (/api/keys)
Run cloud sync: Enable Cloud → Sync Now
Old/non-synced keys can still return 401 on cloud

Docker Issues

CLI Tool Shows Not Installed

Check runtime fields: curl http://localhost:20128/api/cli-tools/runtime/codex | jq
For portable mode: use image target runner-cli (bundled CLIs)
For host mount mode: set CLI_EXTRA_PATHS and mount host bin directory as read-only
If installed=true and runnable=false: binary was found but failed healthcheck

Quick Runtime Validation

curl -s http://localhost:20128/api/cli-tools/codex-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'
curl -s http://localhost:20128/api/cli-tools/claude-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'
curl -s http://localhost:20128/api/cli-tools/openclaw-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'

Cost Issues

High Costs

Check usage stats in Dashboard → Usage
Switch primary model to GLM/MiniMax
Use free tier (Gemini CLI, Qoder) for non-critical tasks
Set cost budgets per API key: Dashboard → API Keys → Budget

Debugging

Enable Log Files

Check Provider Health

# Health dashboard
http://localhost:20128/dashboard/health

# API health check
curl http://localhost:20128/api/monitoring/health

Runtime Storage

Main state: ${DATA_DIR}/storage.sqlite (providers, combos, aliases, keys, settings)
Usage: SQLite tables in storage.sqlite (usage_history, call_logs, proxy_logs) + optional ${DATA_DIR}/call_logs/
Application logs: <repo>/logs/... (when APP_LOG_TO_FILE=true)
Call log artifacts: ${DATA_DIR}/call_logs/YYYY-MM-DD/... when the call log pipeline is enabled

Circuit Breaker Issues

Provider stuck in OPEN state

When a provider's circuit breaker is OPEN, requests are blocked until the cooldown expires.

Fix:

Go to Dashboard → Settings → Resilience
Check the circuit breaker card for the affected provider
Click Reset All to clear all breakers, or wait for the cooldown to expire
Verify the provider is actually available before resetting

Provider keeps tripping the circuit breaker

If a provider repeatedly enters OPEN state:

Check Dashboard → Health → Provider Health for the failure pattern
Go to Settings → Resilience → Provider Profiles and increase the failure threshold
Check if the provider has changed API limits or requires re-authentication
Review latency telemetry — high latency may cause timeout-based failures

Audio Transcription Issues

"Unsupported model" error

Ensure you're using the correct prefix: deepgram/nova-3 or assemblyai/best
Verify the provider is connected in Dashboard → Providers

Transcription returns empty or fails

Check supported audio formats: mp3, wav, m4a, flac, ogg, webm
Verify file size is within provider limits (typically < 25MB)
Check provider API key validity in the provider card

Translator Debugging

Use Dashboard → Translator to debug format translation issues:

Mode	When to Use
Playground	Compare input/output formats side by side — paste a failing request to see how it translates
Chat Tester	Send live messages and inspect the full request/response payload including headers
Test Bench	Run batch tests across format combinations to find which translations are broken
Live Monitor	Watch real-time request flow to catch intermittent translation issues

Common format issues

Thinking tags not appearing — Check if the target provider supports thinking and the thinking budget setting
Tool calls dropping — Some format translations may strip unsupported fields; verify in Playground mode
System prompt missing — Claude and Gemini handle system prompts differently; check translation output
SDK returns raw string instead of object — Resolved in v1.x; response sanitizer strips non-standard fields (x_groq, usage_breakdown, etc.) that cause OpenAI SDK Pydantic validation failures. If you still see this on v3.x+, please file an issue.
GLM/ERNIE rejects system role — Resolved in v1.x; role normalizer automatically merges system messages into user messages for incompatible models. If you still see this on v3.x+, please file an issue.
developer role not recognized — Resolved in v1.x; automatically converted to system for non-OpenAI providers. If you still see this on v3.x+, please file an issue.
json_schema not working with Gemini — Resolved in v1.x; response_format is now converted to Gemini's responseMimeType + responseSchema. If you still see this on v3.x+, please file an issue.

Resilience Settings

Auto rate-limit not triggering

Auto rate-limit only applies to API key providers (not OAuth/subscription)
Verify Settings → Resilience → Provider Profiles has auto-rate-limit enabled
Check if the provider returns 429 status codes or Retry-After headers

Tuning exponential backoff

Provider profiles support these settings:

Base delay — Initial wait time after first failure (default: 1s)
Max delay — Maximum wait time cap (default: 30s)
Multiplier — How much to increase delay per consecutive failure (default: 2x)

Anti-thundering herd

When many concurrent requests hit a rate-limited provider, OmniRoute uses mutex + auto rate-limiting to serialize requests and prevent cascading failures. This is automatic for API key providers.

Optional RAG / LLM failure taxonomy (16 problems)

In practice these incidents usually come from the downstream RAG pipeline, not from the gateway itself.

retrieval drift and broken context boundaries
empty or stale indexes and vector stores
embedding versus semantic mismatch
prompt assembly and context window issues
logic collapse and overconfident answers
long chain and agent coordination failures
multi agent memory and role drift
deployment and bootstrap ordering problems

The idea is simple:

When you investigate a bad response, capture:
- user task and request
- route or provider combo in OmniRoute
- any RAG context used downstream (retrieved documents, tool calls, etc)
Map the incident to one or two WFGY ProblemMap numbers (No.1 … No.16).
Store the number in your own dashboard, runbook, or incident tracker next to the OmniRoute logs.
Use the corresponding WFGY page to decide whether you need to change your RAG stack, retriever, or routing strategy.

Full text and concrete recipes live here (MIT license, text only):

WFGY ProblemMap README

You can ignore this section if you do not run RAG or agent pipelines behind OmniRoute.

v3.8.0 Known Issues

Issues specific to the v3.8.0 release and their current workarounds. If a fix lands in a later patch, the entry will be updated or removed.

Windsurf OAuth flow fails with 401

Symptoms:

"401 unauthorized" while completing the Windsurf OAuth flow from the dashboard
Windsurf provider card stays in "needs reconnection" state after the callback

Causes:

WINDSURF_FIREBASE_API_KEY env var missing or empty
WINDSURF_API_KEY misconfigured or pointing at a stale token
Local firewall/proxy blocking the OAuth callback

Fix:

Verify both WINDSURF_FIREBASE_API_KEY and WINDSURF_API_KEY are set in .env
Restart OmniRoute so the new env values are picked up
Re-run the OAuth flow from Dashboard → Providers → Windsurf → Reconnect

Devin CLI auth failures

Symptoms:

"Devin CLI not found" or "auth failed" when invoking Devin-backed tools
CLI runtime check reports installed=false

Causes:

CLI_DEVIN_BIN points to a path that does not exist
Devin CLI is not installed on the host

Fix:

Install the Devin CLI for your platform
Set CLI_DEVIN_BIN=/usr/local/bin/devin (or the real path) in .env
Restart OmniRoute and re-test from Dashboard → CLI Tools

Model cooldown stuck (manual reset)

Symptoms:

A model stays listed in cooldown even after the expiration time has passed
Requests still skip the model in combo routing despite the timestamp being in the past

Manual reset:

Dashboard: Settings → Model Cooldowns → click Re-enable on the affected card
API: DELETE /api/resilience/model-cooldowns with management auth headers

Command Code provider connection fails with 403

Symptoms:

403 when testing the Command Code provider connection
The provider card shows "unauthorized" after a fresh add

Cause: The OAuth flow did not complete (callback not received or token not persisted).

Fix:

Run omniroute providers from the CLI to re-trigger the OAuth flow, or
Re-run OAuth from Dashboard → Providers → Command Code → Reconnect

ModelScope returns aggressive 429 cooldowns

Symptoms:

Very short or immediate cooldowns on ModelScope after a small burst of requests
Combo routing skips ModelScope earlier than expected

Cause: ModelScope emits provider-specific Retry-After headers. v3.8.0 ships dedicated handling for those headers, so older versions misread them as generic rate-limit hints.

Fix:

Ensure you are on v3.8.0 or later
Verify the useUpstream429BreakerHints toggle is enabled under Settings → Resilience

OMNIROUTE_WS_BRIDGE_SECRET missing in production

Symptoms:

401 on every Codex/Responses WebSocket bridge request when running on a remote production host
WebSocket bridge handshake closes immediately after connect

Cause: The OMNIROUTE_WS_BRIDGE_SECRET env var is missing from the production environment.

Fix:

Generate a random secret: openssl rand -hex 32
Set OMNIROUTE_WS_BRIDGE_SECRET=<random-secret> in the production server env (and any client that talks to the bridge)
Restart OmniRoute

Responses API: background mode degraded to synchronous

Symptoms:

Warning logged: background mode degraded to synchronous
A background: true request returns a normal synchronous response instead of a background job handle

Cause: v3.8.0 intentionally degrades background: true on the Responses API to synchronous execution while emitting a warning. Full async background execution is a future deliverable.

Fix:

Adjust the client to call without background, or
Wait for a later release that ships full async background mode (track the changelog)

Still Stuck?

GitHub Issues: github.com/diegosouzapw/OmniRoute/issues
Architecture: See docs/architecture/ARCHITECTURE.md for internal details
API Reference: See docs/reference/API_REFERENCE.md for all endpoints
Health Dashboard: Check Dashboard → Health for real-time system status
Translator: Use Dashboard → Translator to debug format issues

On this page

On this page