Semantic Caching
Definition
Traditional caching returns stored results only for exact duplicate inputs — useless for natural language queries where users phrase the same question in countless different ways. Semantic caching converts queries to embeddings, stores them in a vector database alongside their model responses, and checks incoming queries for vector similarity to cached entries. Queries within a cosine similarity threshold of a cached query return the cached response without invoking the model. Frameworks like GPTCache, LangChain's semantic caching, and Redis OM implement semantic caching patterns.
Why It Matters
Semantic caching can eliminate 20-40% of LLM API calls for customer-facing applications where users frequently ask similar questions. For a customer support chatbot, hundreds of users may ask variations of 'how do I cancel my subscription?' — semantic caching serves all these variations from a single cached response, drastically reducing LLM inference costs and response latency. Caching is especially valuable for knowledge base Q&A where the answer space is bounded and questions repeat across user sessions.
How It Works
When a query arrives, it is converted to an embedding using the same encoder model used to populate the cache. A vector similarity search finds the nearest cached entry. If the similarity score exceeds the threshold (typically 0.92-0.97 cosine similarity), the cached response is returned and the result is marked as a cache hit for telemetry. Cache misses proceed to the LLM; their responses are stored in the cache with the query embedding. Cache invalidation removes entries when the underlying knowledge base changes.
Semantic Caching — Similar Queries Hit Cache
Query 1 (first time)
"How do I cancel my subscription?"
LLM call → 450ms → store embedding + response
Query 2 (similar)
"Can I end my plan early?"
Cache HIT (cosine sim > 0.95) → 8ms — no LLM call!
Real-World Example
A help center chatbot implements semantic caching with a 0.94 cosine similarity threshold. Over one week, cache analytics show that 38% of all queries hit the cache — variations of the same 50 core questions represent the majority of traffic. Average response latency drops from 1,800ms to 45ms for cache hits, and LLM API costs drop by $1,200/month. The similarity threshold is tuned based on customer feedback — too low returns wrong cached answers, too high has low hit rates.
Common Mistakes
- ✕Setting the similarity threshold too low, returning semantically related but factually different answers to questions that require distinct responses
- ✕Not implementing cache invalidation when source documents change, serving stale cached answers after knowledge base updates
- ✕Caching responses to personalized queries (account-specific questions) that should never be shared across users
Related Terms
Inference Latency
Inference latency is the time between submitting an input to a deployed AI model and receiving the complete output — typically measured in milliseconds for classification models and seconds for large language models — directly impacting user experience and system design.
AI Cost Optimization
AI cost optimization encompasses techniques to reduce the compute, storage, and API expenses of AI systems—through model selection, caching, batching, quantization, and architecture decisions—making AI economically sustainable at scale.
Model Serving
Model serving is the infrastructure that hosts trained ML models and exposes them as APIs, handling prediction requests in production with the latency, throughput, and reliability requirements of real applications.
API Gateway
An API gateway is a managed entry point that sits in front of AI model serving endpoints, handling authentication, rate limiting, request routing, load balancing, and monitoring for all incoming API traffic.
Rate Limiting
Rate limiting is a technique for controlling how many API requests a client can make within a given time window, preventing abuse, ensuring fair resource distribution, and protecting AI model serving infrastructure from being overwhelmed.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →