Back to home

02 — ARCHITECTURE

Model Routing: Choosing the Right Brain

15 min read

You've got your agent running Claude Opus. It's incredible. Handles everything you throw at it. Then you check your API bill at the end of the week: $300. You scan through the logs. Half the requests? "Fix this typo." "Change button color to blue." "Add a console.log here."

You just paid premium prices to have a staff engineer fix typos.

That's the model routing problem. You have powerful, expensive models that can architect entire systems. And you have fast, cheap models that excel at simple edits. The trick is knowing which to use when—automatically, without thinking about it every time.

Why This Actually Matters

Let's talk real numbers. Claude Opus costs $15 per million input tokens. Gemini Flash costs $0.075. That's a 200x difference. If you're routing every request to Opus, you're burning $200 on tasks that Flash could handle for $1.

But here's the thing: you can't just route everything to the cheapest model either. I tried that. Flash is great at UI tweaks, but ask it to implement OAuth with webhook handling and database schema changes? It misses edge cases. Forgets to update related files. You end up spending three retry requests fixing what Opus would've gotten right the first time.

The answer isn't picking one model. It's matching the task to the right specialist.

The Mental Model: Your Agent Team

Think of your available models like specialists on an engineering team:

Gemini Flash is your junior developer. Enthusiastic, fast, handles 80% of daily tasks without breaking a sweat. Sometimes misses nuance on complex architecture, but for straightforward work? Unbeatable speed and cost.

Claude Haiku is your mid-level engineer. Faster than the seniors, more capable than the juniors. Great for setup work, scaffolding, quick fixes. Doesn't have the context window or reasoning depth for complex features, but you don't always need that.

Claude Sonnet is your senior engineer. Takes time to think through problems. Coordinates multi-file changes. Understands dependencies. When you're building a payment system or refactoring auth, this is who you want.

Claude Opus is your staff engineer. The one you bring in for the hardest problems. Expensive, slow, but worth it when you need architectural insights or complex debugging across a large codebase.

GPT-5.1 Codex is your specialist with photographic memory. Massive 200k context window means it can understand sprawling codebases in one shot. Handles large-scale refactoring across dozens of files.

Your router's job? Match the task to the right specialist, automatically.

How Routing Actually Works

Here's the flow when a user sends a request:

User: "Add dark mode toggle to settings"
         ↓
[Check Rule-Based Overrides]
  • Has video attachment? → Gemini Vertex
  • Fixing errors? → Flash
  • Initial setup? → Haiku
         ↓
[No Override: Ask Router]
  • Analyze complexity
  • Check scope (single-file vs multi-file)
  • Detect domain (UI, backend, database)
         ↓
[Router Decision]
  • Simple UI work → Gemini Flash
  • Full-stack feature → Claude Sonnet
         ↓
[Instantiate Agent with Selected Model]
         ↓
[Execute and Track Cost]

The Router Implementation

You use a small, fast model to make routing decisions. Here's the actual implementation from the codebase:

// server/src/agents/coding/router.ts
import { generateObject } from 'ai';
import { z } from 'zod';
 
const routingSchema = z.object({
  modelSelection: z.enum(['claude-sonnet-4-5', 'gemini-3-flash']),
  reasoning: z.string(),
});
 
const ROUTER_PROMPT = `Select the optimal model for this coding task.
 
**claude-sonnet-4-5**: Full-stack work (frontend + backend, database schema,
auth flows, payment integration, multi-component features, architecture decisions)
 
**gemini-3-flash**: Everything else (UI, styling, single-file edits, bug fixes,
simple additions, refactoring)
 
When uncertain, default to gemini-3-flash.`;
 
export async function routeRequest(
  prompt: string,
  attachments: Attachment[],
  errorContext?: string
): Promise<ModelName> {
  // Use cheap model to make routing decision
  const routerModel = openai.languageModel('gpt-4o-mini');
 
  const result = await generateObject({
    model: routerModel,
    schema: routingSchema,
    prompt: `Task: ${prompt}\n\n${errorContext ? `Error context: ${errorContext}` : ''}`,
    system: ROUTER_PROMPT,
  });
 
  return result.object.modelSelection;
}

Notice what's happening here:

  1. Meta-routing: Use a cheap model (GPT-4o Mini, costs $0.0001 per request) to decide which expensive model to use

  2. Structured output: Zod schema ensures you get back a valid model name, not freeform text

  3. Reasoning included: The router explains its choice, useful for debugging

  4. Context-aware: Sees error context from previous attempts

This adds ~150ms latency and $0.0001 cost, but can save you $0.40+ by routing correctly.

Rule-Based Overrides

Some decisions don't need a classifier. They're obvious from the request attributes:

// server/src/agents/coding/agent.ts
async processRequest(options: ProcessRequestOptions) {
  let selectedModel: ModelName;
 
  // Rule 1: Video attachments require Gemini Vertex
  const hasVideo = options.attachments?.some(a => a.type === 'video');
  if (hasVideo) {
    selectedModel = 'gemini-3-flash-vertex';
  }
  // Rule 2: Error fixing uses fast model for quick iteration
  else if (options.isFixingErrors) {
    selectedModel = 'gemini-3-flash';
  }
  // Rule 3: Initial project state uses fast model
  else if (!this.context.beyondInitial) {
    selectedModel = 'gemini-3-flash';
  }
  // Rule 4: User manual override
  else if (options.agentMode !== 'auto') {
    selectedModel = options.agentMode;
  }
  // Rule 5: Let router decide
  else {
    selectedModel = await routeRequest(
      options.userPrompt,
      options.attachments,
      this.context.errorContext
    );
  }
 
  return selectedModel;
}

Why rule-based overrides work:

  • Certainty: Video attachments require Gemini. No need to ask a classifier.
  • Speed: Skip the 150ms router call
  • Predictability: Developers can reason about the behavior
  • Cost: Save the $0.0001 router fee (adds up at scale)

The Error Fixing Pattern

This one's important. When your agent makes a mistake and needs to fix it, use the fast model:

if (previousAttemptFailed) {
  // Error context is clear, fix is usually simple
  // Use fast model for quick iteration
  selectedModel = 'gemini-3-flash';
  isFixingErrors = true;
}

Why? Error context makes the problem obvious. "TypeError: Cannot read property 'name' of undefined at line 42." The fix is usually adding a null check or fixing a typo. Don't pay premium prices for that.

Model Configuration

Different models need different provider setups. Here's how that's managed:

// server/src/agents/coding/agent.ts
private initializeModelMapping() {
  // AWS Bedrock for Claude models (lower latency in certain regions)
  const bedrock = createAmazonBedrock({
    region: process.env.AWS_REGION_NAME,
    accessKeyId: process.env.AWS_ACCESS_KEY_ID,
    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
  });
 
  // OpenRouter for unified API across providers
  const openrouter = createOpenRouter({
    apiKey: process.env.OPENROUTER_API_KEY,
    // Custom fetch wrapper for Anthropic caching
    fetch: createAnthropicOpenRouterFetch({
      headers: {
        'x-anthropic-beta': 'fine-grained-tool-streaming-2025-05-14',
      },
    }),
  });
 
  // Google Vertex AI for Gemini (required for video support)
  const vertex = createVertex({
    project: process.env.GOOGLE_VERTEX_PROJECT,
    location: 'us-central1',
  });
 
  // OpenAI direct (better rate limits)
  const openaiProvider = createOpenAI({
    apiKey: process.env.OPENAI_API_KEY,
  });
 
  return {
    'claude-opus-4.5': bedrock('global.anthropic.claude-opus-4-5-20251101-v1:0'),
    'claude-sonnet-4-5': openrouter('google/gemini-3-flash-preview'),
    'claude-haiku-4-5': bedrock('us.anthropic.claude-haiku-4-5-20251001-v1:0'),
    'gemini-3-flash': openrouter('google/gemini-3-flash-preview'),
    'gemini-3-flash-vertex': vertex('gemini-3-flash-preview'),
    'gpt-5.1-codex-max': openaiProvider('gpt-5.1-codex-max'),
  };
}

Why multiple providers?

  • Bedrock: Lower latency for Claude in certain regions, better availability
  • OpenRouter: Unified API, route between providers with one integration
  • Vertex: Only way to get Gemini video support with good quotas
  • OpenAI Direct: Better rate limits and features for GPT models

Dynamic Agent Initialization

Once you know which model to use, initialize an agent configured for it:

// server/src/agents/coding/agent.ts
private initializeAgent(forcedModel?: string): Agent<AgentContext> {
  const effectiveModel = forcedModel || this.resolveModel();
 
  // GPT models need different prompts
  const systemPrompt = effectiveModel.startsWith('gpt')
    ? GPT_CODING_AGENT_PROMPT
    : CODING_AGENT_PROMPT;
 
  // Thinking only works on certain models
  const thinkingEnabled = [
    'gpt-5.1-codex-max',
    'claude-opus-4.5',
    'claude-sonnet-4-5',
    'gemini-3-flash',
  ].includes(effectiveModel);
 
  // Codex has 4x context window
  const maxTokens = effectiveModel === 'gpt-5.1-codex-max' ? 200_000 : 50_000;
 
  // Codex uses patch tool, others use diff tool
  const editTool = effectiveModel === 'gpt-5.1-codex-max'
    ? applyPatchTool
    : editDiffTool;
 
  return new Agent({
    name: 'Coding Agent',
    instructions: systemPrompt,
    tools: this.getToolsForMode({ ...this.context, editTool }),
    model: aisdk(this.modelMapping[effectiveModel], {
      enableCaching: true,
    }),
    maxTokens,
    ...(thinkingEnabled && {
      thinking: {
        budget: 4000, // Limit reasoning tokens to control cost
      },
    }),
  });
}

Key adaptations by model:

  • System prompts: GPT responds better to different instruction styles than Claude
  • Thinking: Extended reasoning only supported on newer models, limited to 4k tokens
  • Context windows: Codex handles 200k vs 50k for others
  • Tool formats: Codex uses apply_patch, Claude uses edit_diff - different editing paradigms

Cost Tracking with Token Multipliers

Here's how you handle different user tiers and model costs:

// server/src/agents/coding/agent.ts
private getTokenMultiplier(model: ModelName) {
  const isFreeUser = !this.context.user?.subscriptionStatus;
 
  const multipliers: Record<ModelName, {
    input: number;
    cachedInput: number;
    output: number;
  }> = {
    'claude-opus-4.5': isFreeUser
      ? { input: 2.0, cachedInput: 0.2, output: 4.0 }
      : { input: 1.0, cachedInput: 0.1, output: 3.0 },
    'claude-sonnet-4-5': isFreeUser
      ? { input: 1.25, cachedInput: 0.125, output: 3.5 }
      : { input: 1.0, cachedInput: 0.1, output: 2.5 },
    'gemini-3-flash': isFreeUser
      ? { input: 0.05, cachedInput: 0.005, output: 0.3 }
      : { input: 0.06, cachedInput: 0.006, output: 0.4 },
  };
 
  return multipliers[model];
}
 
private async calculateAndDeductTokens(
  usage: TokenUsage,
  model: ModelName
) {
  const multiplier = this.getTokenMultiplier(model);
 
  const weightedTokens =
    (usage.inputTokens * multiplier.input) +
    (usage.cachedInputTokens * multiplier.cachedInput) +
    (usage.outputTokens * multiplier.output);
 
  // Cap at 40k to prevent single huge request from draining all credits
  const tokensToDeduct = Math.min(Math.ceil(weightedTokens), 40_000);
 
  // Atomic RPC function prevents race conditions
  await supabase.rpc('decrement_credits', {
    user_id: this.context.user.id,
    tokens: tokensToDeduct,
  });
}

This accomplishes several things:

Free users pay more: Encourages upgrades without blocking experimentation

Cached tokens are drastically cheaper: Incentivizes prompt caching (which we'll cover)

Output costs more than input: Reflects actual API pricing and generation slowness

Capped deduction: Single runaway request can't drain all credits

Atomic updates: RPC function prevents race conditions when multiple requests run in parallel

Prompt Caching: The Secret Weapon

This is where you really save money. Prompt caching can reduce input token costs by 90%:

// server/src/lib/aisdk.ts
export function aisdk(
  model: LanguageModelV1,
  options: AisdkOptions = {}
): LanguageModelV1 {
  return experimental_wrapLanguageModel({
    model,
    middleware: experimental_createModelMiddleware({
      getCachedMessageIndices: (messages) => {
        const indices: number[] = [];
 
        // Cache last system message
        for (let i = messages.length - 1; i >= 0; i--) {
          if (messages[i].role === 'system') {
            indices.push(i);
            break;
          }
        }
 
        // Cache second-to-last user message
        const userIndices = messages
          .map((m, i) => (m.role === 'user' ? i : -1))
          .filter((i) => i !== -1);
 
        if (userIndices.length >= 2) {
          indices.push(userIndices[userIndices.length - 2]);
 
          // Cache message before second-to-last user message
          const secondLastIdx = userIndices[userIndices.length - 2];
          if (secondLastIdx > 0) {
            indices.push(secondLastIdx - 1);
          }
        }
 
        // Cache last message
        if (messages.length > 0) {
          indices.push(messages.length - 1);
        }
 
        return indices;
      },
    }),
  });
}

What gets cached:

  1. System prompt: Your agent instructions (5k-10k tokens), never changes
  2. Previous user message: When iterating ("now make it blue"), cache the previous exchange
  3. Tool definitions: List of available tools, stays constant
  4. Last message: Current conversation context

The savings:

  • First request: 10,000 input tokens → Pay $0.30 (for Sonnet)
  • Second request: 2,000 new + 8,000 cached → Pay $0.06 + $0.024 = $0.084
  • Third request: 1,500 new + 8,500 cached → Pay $0.045 + $0.0255 = $0.0705

That's 72% cheaper on the second request, 76% cheaper on the third. Over a conversation, caching cuts your bill by 60-80%.

Real-World Routing Example

Let's trace an actual request:

User asks: "Implement Stripe checkout with subscription plans"

Step 1: Pre-flight Checks

hasVideoAttachment: false  // Would force Gemini Vertex
isFixingErrors: false      // Would force Flash
beyondInitial: true        // Would use Haiku for initial setup
agentMode: 'auto'          // No user override
 
// No overrides, proceed to router

Step 2: Router Analyzes

const result = await routeRequest(
  "Implement Stripe checkout with subscription plans",
  [],
  undefined
);
 
// Router thinks:
// - "Stripe checkout" = payment integration
// - "subscription plans" = recurring billing logic
// - Requires: database schema, API routes, webhook handling
// - This is multi-file, full-stack work
// Decision: claude-sonnet-4-5
 
selectedModel = 'claude-sonnet-4-5'
reasoning = 'Complex payment integration requiring database, API routes, and webhooks'

Step 3: Agent Initialization

this.agent = this.initializeAgent('claude-sonnet-4-5');
 
// Configured with:
// - Model: Claude Sonnet via OpenRouter
// - System prompt: CODING_AGENT_PROMPT (not GPT variant)
// - Thinking: enabled with 4k token budget
// - Max tokens: 50,000
// - Tools: editDiffTool (not applyPatch)
// - Caching: enabled

Step 4: Execution and Cost

// Agent executes:
// 1. Searches for Stripe best practices
// 2. Reads existing code (db schema, API routes)
// 3. Edits schema to add subscriptions table
// 4. Creates checkout API route
// 5. Creates webhook handler
// 6. Tests webhook signature verification
 
// Final usage:
inputTokens: 12,000
cachedInputTokens: 8,000  // System prompt, tool definitions
outputTokens: 4,500
 
// Cost calculation:
(12000 / 1M × $3.00) + (8000 / 1M × $0.30) + (4500 / 1M × $15.00)
= $0.036 + $0.0024 + $0.0675
= $0.1059 (~11 cents)
 
// If we'd used Opus:
= $0.18 + $0.012 + $0.3375
= $0.5295 (~53 cents) — 5x more expensive
 
// If we'd used Flash:
= ~2 cents, but would fail or need retries
= Total cost after retries: potentially higher than Sonnet

What If We'd Routed Wrong?

Scenario A: Used Flash

  • Cost: $0.002
  • Result: Partial implementation, missed webhook signature verification
  • User retries: "The webhook isn't working"
  • Second request on Sonnet: $0.10
  • Total: $0.102, but wasted time and bad UX

Scenario B: Used Opus

  • Cost: $0.53
  • Result: Perfect implementation
  • 5x more expensive for same outcome

Good routing found the sweet spot: capable model, reasonable cost, first-try success.

When Routing Gets It Wrong

Your router won't be perfect. Build in recovery:

async processRequest(userPrompt: string) {
  let selectedModel = await routeRequest({ userPrompt });
  let attempt = 0;
 
  while (attempt < 2) {
    try {
      this.agent = this.initializeAgent(selectedModel);
      const result = await run(this.agent, messages);
 
      // Success! Break out
      if (result.success) {
        return result;
      }
 
      // Failed with weak model, escalate to stronger
      if (attempt === 0 && selectedModel === 'gemini-3-flash') {
        console.log('[Router] Flash failed, escalating to Sonnet');
        selectedModel = 'claude-sonnet-4-5';
      }
 
      attempt++;
    } catch (error) {
      if (attempt >= 1) throw error;
      attempt++;
    }
  }
}

Escalation strategy:

  1. Try router's choice
  2. If it fails and was cheap model, escalate to Sonnet
  3. Return result or error

This costs more when routing is wrong, but ensures quality. Over time, use these escalations to improve your router prompt.

Testing Your Router

Build test cases to catch regressions:

const routingTests = [
  {
    prompt: 'Change button color to blue',
    expected: 'gemini-3-flash',
    reason: 'Simple UI change',
  },
  {
    prompt: 'Implement OAuth with Google and GitHub',
    expected: 'claude-sonnet-4-5',
    reason: 'Complex auth integration',
  },
  {
    prompt: 'Add console.log to line 42',
    expected: 'gemini-3-flash',
    reason: 'Trivial edit',
  },
  {
    prompt: 'Create database tables for blog with comments',
    expected: 'claude-sonnet-4-5',
    reason: 'Database schema design',
  },
];
 
for (const test of routingTests) {
  const result = await routeRequest(test.prompt, [], undefined);
  console.assert(
    result === test.expected,
    `Failed: ${test.prompt}\nExpected ${test.expected}, got ${result}\n${test.reason}`
  );
}

Run these on every router prompt change. Your routing logic will drift as you tweak it. Tests catch the drift.

Real-World Results

Here's what good routing achieves:

Before routing (everything on Sonnet):

  • Average: $0.12 per request
  • Monthly for 10k requests: $1,200
  • Average latency: 8.5s

After routing:

  • 70% on Flash (simple): $0.02 × 7,000 = $140
  • 30% on Sonnet (complex): $0.12 × 3,000 = $360
  • Total: $500 per month (58% reduction)
  • Average latency: 4.2s (50% faster)
  • Retry rate: unchanged (quality maintained)

You're not sacrificing quality. You're matching capability to need.

Common Mistakes

Using the expensive model as default: If your fallback is Opus, every router failure costs you money. Default to Flash.

Not caching aggressively: Caching gives you 10x ROI. Enable it on every model.

Ignoring model capabilities: Flash can't handle video. Sonnet can't handle 150k token context. Check capabilities before routing.

Static routing rules: User behavior changes. Model capabilities improve. Update your router monthly based on real usage data.

Over-engineering: Don't create 10 routing tiers. Simple vs complex is usually enough.

The Current Reality

Here's a secret: in the actual codebase, the router is currently disabled. Everything goes to Gemini Flash by default:

// server/src/agents/coding/router.ts (lines 88-92)
// Router is temporarily disabled - always return Flash
export async function routeRequest(...): Promise<ModelName> {
  return 'gemini-3-flash';
}

Why? The team is still tuning the routing logic. But the infrastructure is there. When enabled, it will use the classifier approach we've discussed.

This is actually a good reminder: ship first, optimize later. Start with a single good model. Add routing when you understand your usage patterns. Don't over-engineer upfront.

What's Next

You understand how to route requests to optimal models. But routing is only part of the cost equation. In the next guide, we'll explore the tool system—how to build tools that agents can use, how to structure tool interfaces, and how to handle tool failures gracefully.

Tools are where your agent goes from "chat about code" to "actually write code."