Embedding Cache
Definition
Embedding caches exploit the determinism of embedding models: the same text always produces the same vector. By storing a mapping from text (or its hash) to embedding vector, a cache avoids redundant API calls when the same document chunk or query is encountered again. Two types of caching are common: exact-match caching (store hash → vector, hit only on identical text) and semantic caching (store vector → response, hit when a new query is close in embedding space to a cached query). Embedding caches are particularly valuable at index time, where large document corpora may contain duplicate or near-duplicate passages, and at query time for popular queries that many users ask repeatedly.
Why It Matters
Embedding API calls represent a significant fraction of RAG system costs in high-volume deployments. Paying to embed the same frequently asked question hundreds of times per day is pure waste. For 99helpers chatbots deployed across thousands of businesses, an embedding cache can cut embedding costs by 30-60% on typical support workloads where a small set of common queries dominates traffic. Caches also reduce latency for popular queries, since a cache hit avoids an API round-trip entirely. At indexing time, caching prevents re-embedding unchanged document chunks during incremental updates, making large knowledge base refreshes significantly cheaper.
How It Works
At index time: hash each text chunk, check the cache for existing embeddings, embed only cache misses, then store new embeddings in the cache. The cache can be an in-memory dict for small deployments, Redis for production, or a dedicated caching layer like LangChain's CacheBackedEmbeddings. At query time, hash the query and check the cache before calling the embedding API. For semantic caching, use a secondary vector index storing recent query embeddings; when a new query is within a cosine similarity threshold of a cached query, return the cached response directly. GPTCache is an open-source library purpose-built for LLM and embedding response caching.
Embedding Cache — Hit vs Miss Flow
Cache Hit
0 msQuery arrives
Hash lookup
SHA-256(query)
Found in cache
Return cached vector
Instant
Continue to retrieval
Cache Miss
150 msQuery arrives
Hash lookup
SHA-256(query)
Not in cache
Call embedding model
~150 ms latency
Store in cache
For future hits
Cache performance metrics
73%
Hit rate
110 ms
Avg latency saved
150 ms
Miss latency
At 73% hit rate and 1,000 queries/hour: saves ~800 embedding model calls/hour and reduces avg latency by ~80 ms.
Real-World Example
A 99helpers customer's chatbot receives 10,000 queries per day, but analysis shows that the top 200 distinct queries account for 60% of traffic (typical power-law distribution for support queries). Without an embedding cache, every query hits the OpenAI Embeddings API. With an embedding cache using Redis, the 200 most common queries are cached after their first occurrence. The next day, 6,000 of the 10,000 queries hit the cache, reducing embedding API calls by 60% and cutting embedding costs from $12/day to $4.80/day.
Common Mistakes
- ✕Caching embeddings without versioning the embedding model—when the model changes, all cached embeddings are stale and must be invalidated.
- ✕Using an in-memory cache in a multi-process or distributed deployment—cache misses are not shared across instances without a shared cache (Redis, Memcached).
- ✕Forgetting to set TTL (time-to-live) on cached query embeddings—stale query caches can return irrelevant results if the knowledge base has been updated.
Related Terms
Embedding Model
An embedding model is a machine learning model that converts text (or other data) into dense numerical vectors that capture semantic meaning, enabling similarity search and serving as the foundation of RAG retrieval systems.
Retrieval Pipeline
A retrieval pipeline is the online query-time workflow that transforms a user question into a ranked set of relevant document chunks, serving as the information retrieval stage of a RAG system.
Indexing Pipeline
An indexing pipeline is the offline data processing workflow that transforms raw documents into searchable vector embeddings, running during knowledge base setup and when content is updated.
Vector Database
A vector database is a purpose-built data store optimized for storing, indexing, and querying high-dimensional numerical vectors (embeddings), enabling fast similarity search across large collections of embedded documents.
RAG Pipeline
A RAG pipeline is the end-to-end sequence of components—ingestion, chunking, embedding, storage, retrieval, and generation—that transforms raw documents into AI-generated answers grounded in a knowledge base.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →