Back to home

01 — CORE ARCHITECTURE

Model mapping & max tokens

5 min read

Every model has limits. Claude can handle 200K tokens. GPT-5 pushes 1M. Gemini sits somewhere in between. Ignore these limits and your requests fail. Respect them and your agent stays reliable.

But it's not just about staying under the cap. Different models need different configurations. Some excel with long context. Others are fast but constrained. Your agent needs to know which model it's talking to and set limits accordingly.

We'll build a system that maps models to their token limits, estimates usage before making requests, and gracefully handles when you're over budget. This keeps your agent working even when context gets large.

Why Token Limits Matter

Two limits control what your agent can do:

Context window: How much total context (input + output) the model can handle in one request. Claude Opus 4.5 has a 200K token context window. That's your ceiling.

Max output tokens: How many tokens the model can generate in its response. You set this per request. Go too high and you waste time. Go too low and responses get cut off.

Your agent needs both. Context window determines if your request fits. Max output tokens controls how long the response can be.

Get either wrong and things break. Request too large? API error. Output limit too small? Incomplete code. Output limit too large? Slow responses and wasted tokens.

Setting Up a Max Tokens Mapping

Create a mapping from model slugs to their ideal max output tokens:

export class CodingAgent {
  private maxTokensMapping: Record<string, number> = {
    'claude-sonnet-4-5': 50000,
    'claude-opus-4.5': 50000,
    'claude-haiku-4-5': 50000,
    'gpt-5.1-codex-max': 200000,  // High-context model for large operations
    'gemini-3-flash': 50000,
    'grok-4-fast': 25000,         // Faster but smaller output
  };
 
  private getMaxTokens(modelSlug: string): number {
    return this.maxTokensMapping[modelSlug] || 8192;  // Safe default
  }
}

When building your agent, look up the max tokens:

private buildAgent(modelSlug: string) {
  const model = this.modelMapping[modelSlug];
  const maxTokens = this.getMaxTokens(modelSlug);
 
  return new Agent({
    name: 'Coding Agent',
    instructions: this.createPrompt(),
    tools: this.loadTools(modelSlug),
    model: model,
    modelSettings: {
      maxTokens,
      truncation: 'auto',
    }
  });
}

Now every model gets the right output limit. Codex Max can generate 200K tokens for massive refactors. Grok stays at 25K for quick responses.

Understanding Context Windows

Context window is different from max output tokens. It's the total budget for the entire conversation.

Here's a practical mapping:

const MODEL_CONTEXT_WINDOWS: Record<string, number> = {
  // Claude models
  'claude-opus-4.5': 200000,
  'claude-sonnet-4-5': 200000,
  'claude-haiku-4-5': 200000,
 
  // GPT models
  'gpt-5.1-codex-max': 1000000,
  'gpt-5.2': 1000000,
 
  // Gemini models
  'gemini-3-flash': 1000000,
  'gemini-2.5-pro': 1000000,
 
  // Default fallback
  'default': 128000,
};
 
function getContextWindow(modelSlug: string): number {
  const modelLower = modelSlug.toLowerCase();
 
  // Try exact match
  if (MODEL_CONTEXT_WINDOWS[modelLower]) {
    return MODEL_CONTEXT_WINDOWS[modelLower];
  }
 
  // Try partial match
  for (const [key, value] of Object.entries(MODEL_CONTEXT_WINDOWS)) {
    if (key !== 'default' && modelLower.includes(key)) {
      return value;
    }
  }
 
  return MODEL_CONTEXT_WINDOWS['default'];
}

Use this when deciding if your conversation history fits. If you're at 180K tokens and the context window is 200K, you're close to the limit. Time to truncate.

Token Counting Strategies

You need to estimate tokens before making requests. Two approaches work:

Character-Based Estimation (Fast)

Rough but fast. Use ~4 characters per token:

function estimateTokens(text: string): number {
  // Add overhead for role/structure
  const overhead = 16;  // ~4 tokens
  return Math.ceil((text.length + overhead) / 4);
}
 
function estimateMessageTokens(message: any): number {
  let charCount = 16;  // Role overhead
 
  if (typeof message.content === 'string') {
    charCount += message.content.length;
  } else if (Array.isArray(message.content)) {
    for (const part of message.content) {
      if (part.type === 'text' && part.text) {
        charCount += part.text.length;
      } else if (part.type === 'image') {
        charCount += 4000;  // Images are expensive (~1000 tokens)
      }
    }
  }
 
  return Math.ceil(charCount / 4);
}

Good enough for most cases. Images get a fixed estimate. Text scales with length.

Tiktoken-Based Counting (Accurate)

For precise counting, use tiktoken with the o200k_base encoding:

import { Tiktoken } from 'js-tiktoken/lite';
import { o200k_base } from 'js-tiktoken/ranks/o200k_base';
 
function countTokensAccurate(messages: any[]): number {
  const encoding = new Tiktoken(o200k_base);
  let totalTokens = 0;
 
  for (const message of messages) {
    // Count role
    totalTokens += encoding.encode(message.role).length;
 
    // Count content
    if (typeof message.content === 'string') {
      totalTokens += encoding.encode(message.content).length;
    } else if (Array.isArray(message.content)) {
      for (const item of message.content) {
        if (item.type === 'text' && item.text) {
          totalTokens += encoding.encode(item.text).length;
        } else if (item.type === 'image') {
          totalTokens += 85;  // Base cost for images
        }
      }
    }
  }
 
  encoding.free();  // Clean up
  return totalTokens;
}

Accurate but slower. Use this when you need exact counts or you're close to the limit.

Handling Context Overflow

When your conversation history exceeds the context window, you need to truncate. Don't just drop everything—be strategic.

Set a Threshold

Don't wait until you hit the limit. Truncate at 80-85% of the context window:

const CONTEXT_THRESHOLD = 0.85;
 
function shouldTruncate(
  currentTokens: number,
  contextWindow: number
): boolean {
  const maxAllowed = Math.floor(contextWindow * CONTEXT_THRESHOLD);
  return currentTokens > maxAllowed;
}

This gives you room for the response without hitting the ceiling.

Truncation Strategy

Always preserve:

  • System messages (they define behavior)
  • The current user message (it's what you're responding to)

For everything else, drop in this order:

  1. Compress tool results first: Summarize or drop old tool outputs
  2. Compress assistant messages: Shorten reasoning or older responses
  3. Drop entire conversation turns: Remove old assistant + tool result pairs

Here's the pattern:

function truncateMessages(
  messages: any[],
  contextWindow: number,
  threshold: number = 0.85
): any[] {
  const maxTokens = Math.floor(contextWindow * threshold);
  let currentTokens = estimateTotalTokens(messages);
 
  if (currentTokens <= maxTokens) {
    return messages;  // Already fits
  }
 
  const result = [];
  let systemMessage = null;
  let userMessage = null;
 
  // Preserve system and last user message
  for (const msg of messages) {
    if (msg.role === 'system') {
      systemMessage = msg;
    } else if (msg.role === 'user') {
      userMessage = msg;  // Keep overwriting to get the last one
    }
  }
 
  // Add system message first
  if (systemMessage) {
    result.push(systemMessage);
  }
 
  // Add recent history (skip system and last user)
  const historyMessages = messages.filter(
    msg => msg !== systemMessage && msg !== userMessage
  );
 
  // Take most recent messages until we hit the budget
  let historyTokens = 0;
  const allowedHistoryTokens = maxTokens -
    (systemMessage ? estimateMessageTokens(systemMessage) : 0) -
    (userMessage ? estimateMessageTokens(userMessage) : 0);
 
  for (let i = historyMessages.length - 1; i >= 0; i--) {
    const msg = historyMessages[i];
    const msgTokens = estimateMessageTokens(msg);
 
    if (historyTokens + msgTokens <= allowedHistoryTokens) {
      result.splice(result.length - (userMessage ? 0 : 0), 0, msg);
      historyTokens += msgTokens;
    } else {
      break;  // No more room
    }
  }
 
  // Add current user message last
  if (userMessage) {
    result.push(userMessage);
  }
 
  return result;
}

Recent context matters more than old context. Keep the latest, drop the oldest.

Model-Specific Considerations

Different models need different handling:

Extended Thinking Models

Some models support "thinking" or "reasoning" tokens. Reserve budget for them:

const THINKING_ENABLED_MODELS = [
  'claude-opus-4.5',
  'claude-sonnet-4-5',
  'gpt-5.2',
];
 
function buildModelSettings(modelSlug: string) {
  const maxTokens = this.getMaxTokens(modelSlug);
  const enableThinking = THINKING_ENABLED_MODELS.includes(modelSlug);
 
  return {
    maxTokens,
    truncation: 'auto',
    ...(enableThinking && {
      thinkingBudgetTokens: 4000,  // Reserve 4K for reasoning
    }),
  };
}

High-Context Models

Models like GPT-5.1 Codex Max can handle massive context. Use them for big refactors:

function selectModel(taskType: string): string {
  if (taskType === 'large-refactor' || taskType === 'codebase-analysis') {
    return 'gpt-5.1-codex-max';  // 1M context window
  }
 
  if (taskType === 'quick-fix') {
    return 'claude-haiku-4-5';  // Fast and cheap
  }
 
  return 'claude-sonnet-4-5';  // Balanced default
}

Fast but Limited Models

Some models trade context for speed:

const FAST_MODELS = {
  'grok-4-fast': { maxTokens: 25000, contextWindow: 131072 },
  'gemini-flash': { maxTokens: 50000, contextWindow: 1000000 },
};

Use these for quick operations where you don't need full history.

Putting It Together

Here's how it flows in practice:

async processRequest(userPrompt: string, conversationHistory: any[]) {
  await this.initialize();
 
  // Build messages
  const messages = [
    { role: 'system', content: this.createPrompt() },
    ...conversationHistory,
    { role: 'user', content: this.buildUserMessage(userPrompt) }
  ];
 
  // Get model info
  const contextWindow = getContextWindow(this.agentMode);
  const maxTokens = this.getMaxTokens(this.agentMode);
 
  // Truncate if needed
  const truncatedMessages = truncateMessages(
    messages,
    contextWindow,
    0.85
  );
 
  // Build agent with token limits
  const agent = this.buildAgent(this.agentMode);
 
  // Run with streaming
  const stream = await run(agent, truncatedMessages, {
    context: this.buildContext(),
    maxTurns: 100,
    stream: true,
  });
 
  return stream;
}

Check context window. Truncate if over budget. Set max output tokens. Everything stays within bounds.

What We're Skipping

There's more you might add—dynamic token budgets based on task complexity, smart compression algorithms, token usage tracking for billing. We'll cover these in later guides.

Right now, focus on the basics. Map models to limits, estimate tokens, truncate when needed. Get this working first, then optimize.

What's Next

You have models configured with proper token limits. In the next guide, we'll build the actual tools your agent can call—reading files, searching code, executing commands.

That's where your agent becomes truly capable.