LLM Router
Definition
An LLM router is an intelligent dispatch layer that decides which model (or model configuration) should handle each incoming query. Rather than sending all queries to a single expensive frontier model, a router classifies queries by complexity, domain, or required capability and dispatches them to the appropriate model: fast, cheap models (GPT-4o-mini, Claude Haiku, local Llama-3-8B) for simple queries; capable frontier models (GPT-4o, Claude 3.5 Sonnet) for complex or high-stakes queries. Routing logic can be rule-based (keyword detection, query length thresholds), ML-based (a small classifier), or LLM-based (asking a small model 'how complex is this query?'). Tools like RouteLLM and LiteLLM provide routing infrastructure.
Why It Matters
LLM routing is one of the highest-ROI optimizations for production AI deployments. In most support chatbot deployments, 60-70% of queries are simple factual questions that a capable small model handles equally well as a frontier model. Routing these to a cheaper model while preserving frontier model quality for genuinely complex queries can reduce total LLM costs by 40-60% with no perceptible quality degradation. For 99helpers customers at scale—thousands of queries per day—this represents substantial savings. Routing also enables quality tiers: enterprise customers could get routing that prioritizes quality, while free-tier users get efficient routing.
How It Works
Router implementation options: (1) rule-based—query length < 50 tokens → simple model; contains technical jargon → complex model; (2) classifier—train a lightweight model to predict required capability; (3) LLM-based—use a fast small model to assess complexity before routing: 'Rate this query complexity 1-5: [query]'; (4) cascading—try cheap model first; if confidence is low (logprob below threshold), escalate to expensive model. RouteLLM implements several routing algorithms trained on human preference data, enabling automatic quality-cost tradeoff optimization. LiteLLM provides a unified API for routing across providers with fallback chains.
LLM Router — Complexity-Based Model Selection
Incoming query
"What is the capital of France?"
Router / Classifier
Fast model or rule-based scoring on complexity, task type, cost budget
GPT-4o
Multi-step reasoning, code, analysis
$10 / 1M out
Claude Haiku
Summarization, translation, Q&A
$1.25 / 1M out
Llama 3 8B
Intent classification, slot filling
$0.06 / 1M out
Routing simple queries to cheaper models can reduce LLM spend by 60–80% with minimal quality degradation — the router itself is a tiny, fast model or rule set.
Real-World Example
A 99helpers platform implements a three-tier routing system: (1) Llama-3-8B (self-hosted, ~$0.00005/query) for simple FAQ queries (identified by embedding similarity to a library of simple questions); (2) Claude 3.5 Haiku ($0.00025/query) for moderate complexity queries; (3) Claude 3.5 Sonnet ($0.003/query) for complex queries (multi-step reasoning, escalation requests, sentiment indicating frustration). Distribution: 60% tier 1, 30% tier 2, 10% tier 3. Blended cost: 0.6×$0.00005 + 0.3×$0.00025 + 0.1×$0.003 = $0.000405/query. Without routing (all Sonnet): $0.003/query. 86% cost reduction with equivalent user-perceived quality.
Common Mistakes
- ✕Building a complex router before profiling query distribution—start by analyzing what fraction of your actual queries are genuinely complex before building a multi-tier routing system.
- ✕Setting hard routing rules without a fallback—if the router misclassifies a complex query as simple, the cheap model produces a poor response; monitor quality by tier.
- ✕Ignoring routing latency overhead—a routing decision that adds 200ms of latency is not worth a 5% cost savings for latency-sensitive applications.
Related Terms
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
Model Evaluation
Model evaluation is the systematic process of measuring an LLM's performance on relevant tasks and quality dimensions, guiding decisions about model selection, fine-tuning, and deployment readiness.
Adaptive RAG
Adaptive RAG dynamically selects the retrieval strategy—no retrieval, single-step retrieval, or multi-step iterative retrieval—based on the complexity of each query, optimizing cost and latency without sacrificing answer quality.
Guardrails
Guardrails are input and output validation mechanisms layered around LLM calls to detect and block unsafe, off-topic, or non-compliant content, providing application-level safety beyond the model's built-in alignment.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →