Large Language Models (LLMs)

LLM Router

Definition

An LLM router is an intelligent dispatch layer that decides which model (or model configuration) should handle each incoming query. Rather than sending all queries to a single expensive frontier model, a router classifies queries by complexity, domain, or required capability and dispatches them to the appropriate model: fast, cheap models (GPT-4o-mini, Claude Haiku, local Llama-3-8B) for simple queries; capable frontier models (GPT-4o, Claude 3.5 Sonnet) for complex or high-stakes queries. Routing logic can be rule-based (keyword detection, query length thresholds), ML-based (a small classifier), or LLM-based (asking a small model 'how complex is this query?'). Tools like RouteLLM and LiteLLM provide routing infrastructure.

Why It Matters

LLM routing is one of the highest-ROI optimizations for production AI deployments. In most support chatbot deployments, 60-70% of queries are simple factual questions that a capable small model handles equally well as a frontier model. Routing these to a cheaper model while preserving frontier model quality for genuinely complex queries can reduce total LLM costs by 40-60% with no perceptible quality degradation. For 99helpers customers at scale—thousands of queries per day—this represents substantial savings. Routing also enables quality tiers: enterprise customers could get routing that prioritizes quality, while free-tier users get efficient routing.

How It Works

Router implementation options: (1) rule-based—query length < 50 tokens → simple model; contains technical jargon → complex model; (2) classifier—train a lightweight model to predict required capability; (3) LLM-based—use a fast small model to assess complexity before routing: 'Rate this query complexity 1-5: [query]'; (4) cascading—try cheap model first; if confidence is low (logprob below threshold), escalate to expensive model. RouteLLM implements several routing algorithms trained on human preference data, enabling automatic quality-cost tradeoff optimization. LiteLLM provides a unified API for routing across providers with fallback chains.

LLM Router — Complexity-Based Model Selection

Incoming query

"What is the capital of France?"

Router / Classifier

Fast model or rule-based scoring on complexity, task type, cost budget

Complexity scoreTask typeCost budget

Complex query

GPT-4o

Multi-step reasoning, code, analysis

$10 / 1M out

Moderate query

Claude Haiku

Summarization, translation, Q&A

$1.25 / 1M out

Simple query

Llama 3 8B

Intent classification, slot filling

$0.06 / 1M out

Routing simple queries to cheaper models can reduce LLM spend by 60–80% with minimal quality degradation — the router itself is a tiny, fast model or rule set.

Real-World Example

A 99helpers platform implements a three-tier routing system: (1) Llama-3-8B (self-hosted, ~$0.00005/query) for simple FAQ queries (identified by embedding similarity to a library of simple questions); (2) Claude 3.5 Haiku ($0.00025/query) for moderate complexity queries; (3) Claude 3.5 Sonnet ($0.003/query) for complex queries (multi-step reasoning, escalation requests, sentiment indicating frustration). Distribution: 60% tier 1, 30% tier 2, 10% tier 3. Blended cost: 0.6×$0.00005 + 0.3×$0.00025 + 0.1×$0.003 = $0.000405/query. Without routing (all Sonnet): $0.003/query. 86% cost reduction with equivalent user-perceived quality.

Common Mistakes

✕Building a complex router before profiling query distribution—start by analyzing what fraction of your actual queries are genuinely complex before building a multi-tier routing system.
✕Setting hard routing rules without a fallback—if the router misclassifies a complex query as simple, the cheap model produces a poor response; monitor quality by tier.
✕Ignoring routing latency overhead—a routing decision that adds 200ms of latency is not worth a 5% cost savings for latency-sensitive applications.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

LLM Router

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

LLM API

LLM Inference

Model Evaluation

Adaptive RAG

Guardrails

Ready to build your AI chatbot?