AI Infrastructure, Safety & Ethics

Semantic Caching

Definition

Traditional caching returns stored results only for exact duplicate inputs — useless for natural language queries where users phrase the same question in countless different ways. Semantic caching converts queries to embeddings, stores them in a vector database alongside their model responses, and checks incoming queries for vector similarity to cached entries. Queries within a cosine similarity threshold of a cached query return the cached response without invoking the model. Frameworks like GPTCache, LangChain's semantic caching, and Redis OM implement semantic caching patterns.

Why It Matters

Semantic caching can eliminate 20-40% of LLM API calls for customer-facing applications where users frequently ask similar questions. For a customer support chatbot, hundreds of users may ask variations of 'how do I cancel my subscription?' — semantic caching serves all these variations from a single cached response, drastically reducing LLM inference costs and response latency. Caching is especially valuable for knowledge base Q&A where the answer space is bounded and questions repeat across user sessions.

How It Works

When a query arrives, it is converted to an embedding using the same encoder model used to populate the cache. A vector similarity search finds the nearest cached entry. If the similarity score exceeds the threshold (typically 0.92-0.97 cosine similarity), the cached response is returned and the result is marked as a cache hit for telemetry. Cache misses proceed to the LLM; their responses are stored in the cache with the query embedding. Cache invalidation removes entries when the underlying knowledge base changes.

Semantic Caching — Similar Queries Hit Cache

Query 1 (first time)

"How do I cancel my subscription?"

LLM call → 450ms → store embedding + response

Query 2 (similar)

"Can I end my plan early?"

Cache HIT (cosine sim > 0.95) → 8ms — no LLM call!

Real-World Example

A help center chatbot implements semantic caching with a 0.94 cosine similarity threshold. Over one week, cache analytics show that 38% of all queries hit the cache — variations of the same 50 core questions represent the majority of traffic. Average response latency drops from 1,800ms to 45ms for cache hits, and LLM API costs drop by $1,200/month. The similarity threshold is tuned based on customer feedback — too low returns wrong cached answers, too high has low hit rates.

Common Mistakes

✕Setting the similarity threshold too low, returning semantically related but factually different answers to questions that require distinct responses
✕Not implementing cache invalidation when source documents change, serving stale cached answers after knowledge base updates
✕Caching responses to personalized queries (account-specific questions) that should never be shared across users

Related Terms

Inference Latency

Inference latency is the time between submitting an input to a deployed AI model and receiving the complete output — typically measured in milliseconds for classification models and seconds for large language models — directly impacting user experience and system design.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Semantic Caching

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Inference Latency

AI Cost Optimization

Model Serving

API Gateway

Rate Limiting

Ready to build your AI chatbot?