Skip to content

Route Configuration

Routes define which provider handles which type of request. Admin Bud-E supports four service types:

  • LLM — Large Language Models (text generation)
  • VLM — Vision-Language Models (text + image understanding)
  • TTS — Text-to-Speech (audio generation)
  • ASR — Automatic Speech Recognition (audio transcription)

Each route specifies:

  • Service type (LLM/VLM/TTS/ASR)
  • Provider (which provider to use)
  • Model (specific model identifier)
  • Priority (for failover: lower numbers = higher priority)

How Routes Work

When a request comes in:

  1. Admin Bud-E identifies the service type (LLM, VLM, TTS, or ASR)
  2. Finds all routes for that type, sorted by priority (ascending)
  3. Tries the highest priority route (lowest number) first
  4. If that fails with a retryable error (429, 5xx), tries the next priority
  5. Continues until success or all routes exhausted

This gives you automatic failover without client-side changes.

Adding a Route

  1. Navigate to Routes in the Admin UI
  2. Click Add Route
  3. Fill in:
    • Type: Select service type (LLM/VLM/TTS/ASR)
    • Provider: Select from your configured providers
    • Model: Model identifier (provider-specific)
    • Priority: Number (1 = highest, 10 = lowest)
  4. Click Save

Priority System

Priority determines the order routes are tried.

Rules:

  • Lower numbers = higher priority = tried first
  • If multiple routes have the same priority, order is undefined
  • Gaps are allowed (you can use priorities 1, 5, 10)

Common patterns:

Single Provider (No Failover)

LLM Routes:

  • Route 1: Vertex Gemini, priority 1

All LLM requests go to Vertex. If it fails, the request fails.

Two Providers (Simple Failover)

LLM Routes:

  • Route 1: Vertex Gemini, priority 1
  • Route 2: Together Llama, priority 2

Tries Vertex first. On rate-limit or server error, falls back to Together.

Three Providers (Full Redundancy)

LLM Routes:

  • Route 1: Vertex Gemini, priority 1
  • Route 2: Together Llama, priority 2
  • Route 3: Mistral Large, priority 3

Maximum availability. Request succeeds unless all three fail.

Failover Behavior

Retryable Errors

Admin Bud-E automatically retries the next route when it sees:

  • 429 Too Many Requests (rate limit)
  • 5xx Server Errors (503, 502, 500, etc.)
  • Network timeouts

Non-Retryable Errors

These errors do not trigger failover:

  • 400 Bad Request (malformed input)
  • 401 Unauthorized (invalid API key)
  • 404 Not Found (model doesn't exist)
  • 4xx Client Errors (except 429)

The request fails immediately with the error message.

TIP

Non-retryable errors usually indicate configuration problems. Check your provider settings and model names.

Model Identifiers

Model names are provider-specific. Check each provider's documentation:

Vertex AI (Google)

LLM/VLM:

  • gemini-1.5-flash (fast, cost-effective)
  • gemini-1.5-pro (most capable)
  • gemini-1.0-pro (legacy)

TTS:

ASR:

  • default (standard model)
  • latest_long (optimized for longer audio)

Together AI

Example models:

  • meta-llama/Llama-3-70b-chat-hf
  • mistralai/Mixtral-8x7B-Instruct-v0.1
  • togethercomputer/CodeLlama-34b-Instruct

See Together's model list.

Mistral AI

Models:

  • mistral-large-latest (most capable)
  • mistral-medium-latest (balanced)
  • mistral-small-latest (fast, efficient)

OpenAI

Models:

  • gpt-4-turbo
  • gpt-4
  • gpt-3.5-turbo

Route Examples

Example 1: Vertex-Only Setup

Goal: Use only Google Vertex AI for everything.

Routes:

  • LLM: Vertex, gemini-1.5-flash, priority 1
  • VLM: Vertex, gemini-1.5-pro, priority 1
  • TTS: Vertex, en-US-Neural2-C, priority 1
  • ASR: Vertex, default, priority 1

Example 2: Cost-Optimized with Failover

Goal: Use cheapest provider first, fall back to premium on rate limits.

LLM Routes:

  • Together, meta-llama/Llama-3-70b-chat-hf, priority 1 (cheap, fast)
  • Vertex, gemini-1.5-flash, priority 2 (fallback)
  • Mistral, mistral-large-latest, priority 3 (last resort)

VLM Routes:

  • Vertex, gemini-1.5-flash, priority 1 (multimodal)

Example 3: Geographic Redundancy

Goal: Try EU provider first, fall back to US if needed.

LLM Routes:

  • Mistral, mistral-large-latest, priority 1 (EU)
  • Vertex EU, gemini-1.5-pro, priority 2 (EU)
  • Together, meta-llama/Llama-3-70b-chat-hf, priority 3 (may be US)

Editing Routes

To change an existing route:

  1. Navigate to Routes
  2. Click Edit on the route
  3. Update fields
  4. Click Save

Changes take effect immediately.

Deleting Routes

To remove a route:

  1. Navigate to Routes
  2. Click Delete on the route
  3. Confirm deletion

WARNING

If you delete all routes for a service type (e.g., all LLM routes), requests of that type will fail.

Debugging Routes

Problem: "No route found for service type LLM"

Cause: No LLM routes exist.

Solution: Add at least one LLM route.

Problem: "Provider 'vertex' not found"

Cause: Route references a provider that doesn't exist or has a typo.

Solution:

  1. Check provider name exactly matches (case-sensitive)
  2. For Vertex, ensure name is exactly vertex (lowercase)
  3. Verify provider exists in Providers page

Problem: All requests fail with 429

Cause: Rate limit on primary provider, no failover configured.

Solution: Add a second provider with priority 2.

Problem: Wrong model being used

Cause: Route priority is incorrect.

Solution: Check priorities — lower number = higher priority.

Testing Routes

After configuring routes:

  1. Make a test request from the frontend
  2. Check Usage → detailed logs
  3. Verify:
    • Correct provider was used
    • Correct model was invoked
    • Credits match expected pricing

If failover occurred, you'll see multiple entries (one per attempt).

Performance Considerations

Latency

Each failover attempt adds latency:

  • Primary failure: ~5-30 seconds (depends on timeout)
  • Secondary attempt: additional ~5-30 seconds

Best practices:

  • Use fast providers as primary
  • Set reasonable timeouts
  • Monitor failure rates

Cost

Failover attempts may be billed by the provider even if they fail:

  • 429 errors usually don't incur charges
  • Partial responses (timeouts) might incur charges
  • Check provider billing for details

Rate Limits

If you're hitting rate limits often:

  1. Request quota increases from provider
  2. Add more providers to spread load
  3. Implement client-side rate limiting
  4. Use slower/cheaper models with higher quotas

Advanced Patterns

Load Balancing

For high traffic, alternate between providers:

LLM Routes:

  • Vertex, gemini-1.5-flash, priority 1
  • Together, meta-llama/Llama-3-70b-chat-hf, priority 1

Both have same priority → middleware picks one (implementation-defined, typically round-robin or random).

WARNING

Load balancing behavior with equal priorities is not guaranteed. For explicit control, use different priorities.

Cost vs. Quality Tiers

LLM Routes:

  • Together, meta-llama/Llama-3-8b-chat-hf, priority 1 (cheap, fast, decent)
  • Vertex, gemini-1.5-flash, priority 2 (more capable, more expensive)
  • Vertex, gemini-1.5-pro, priority 3 (best quality, most expensive)

Most requests use the cheap model. On rate-limits or errors, step up to better models.

Model-Specific Routing

Different models for different use cases:

Text-only (LLM):

  • Vertex, gemini-1.5-flash, priority 1

Multimodal/Images (VLM):

  • Vertex, gemini-1.5-pro, priority 1 (better vision understanding)

Long context:

  • Together, togethercomputer/CodeLlama-34b-Instruct, priority 1

Next Steps