Retrieval-Augmented Generation (RAG)

Embedding Cache

Definition

Embedding caches exploit the determinism of embedding models: the same text always produces the same vector. By storing a mapping from text (or its hash) to embedding vector, a cache avoids redundant API calls when the same document chunk or query is encountered again. Two types of caching are common: exact-match caching (store hash → vector, hit only on identical text) and semantic caching (store vector → response, hit when a new query is close in embedding space to a cached query). Embedding caches are particularly valuable at index time, where large document corpora may contain duplicate or near-duplicate passages, and at query time for popular queries that many users ask repeatedly.

Why It Matters

Embedding API calls represent a significant fraction of RAG system costs in high-volume deployments. Paying to embed the same frequently asked question hundreds of times per day is pure waste. For 99helpers chatbots deployed across thousands of businesses, an embedding cache can cut embedding costs by 30-60% on typical support workloads where a small set of common queries dominates traffic. Caches also reduce latency for popular queries, since a cache hit avoids an API round-trip entirely. At indexing time, caching prevents re-embedding unchanged document chunks during incremental updates, making large knowledge base refreshes significantly cheaper.

How It Works

At index time: hash each text chunk, check the cache for existing embeddings, embed only cache misses, then store new embeddings in the cache. The cache can be an in-memory dict for small deployments, Redis for production, or a dedicated caching layer like LangChain's CacheBackedEmbeddings. At query time, hash the query and check the cache before calling the embedding API. For semantic caching, use a secondary vector index storing recent query embeddings; when a new query is within a cosine similarity threshold of a cached query, return the cached response directly. GPTCache is an open-source library purpose-built for LLM and embedding response caching.

Embedding Cache — Hit vs Miss Flow

Cache Hit

0 ms

Query arrives

Hash lookup

SHA-256(query)

Found in cache

Return cached vector

Instant

Continue to retrieval

Cache Miss

150 ms

Query arrives

Hash lookup

SHA-256(query)

Not in cache

Call embedding model

~150 ms latency

Store in cache

For future hits

Cache performance metrics

73%

Hit rate

110 ms

Avg latency saved

150 ms

Miss latency

Hit 73%
Miss 27%

At 73% hit rate and 1,000 queries/hour: saves ~800 embedding model calls/hour and reduces avg latency by ~80 ms.

Real-World Example

A 99helpers customer's chatbot receives 10,000 queries per day, but analysis shows that the top 200 distinct queries account for 60% of traffic (typical power-law distribution for support queries). Without an embedding cache, every query hits the OpenAI Embeddings API. With an embedding cache using Redis, the 200 most common queries are cached after their first occurrence. The next day, 6,000 of the 10,000 queries hit the cache, reducing embedding API calls by 60% and cutting embedding costs from $12/day to $4.80/day.

Common Mistakes

  • Caching embeddings without versioning the embedding model—when the model changes, all cached embeddings are stale and must be invalidated.
  • Using an in-memory cache in a multi-process or distributed deployment—cache misses are not shared across instances without a shared cache (Redis, Memcached).
  • Forgetting to set TTL (time-to-live) on cached query embeddings—stale query caches can return irrelevant results if the knowledge base has been updated.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Embedding Cache? Embedding Cache Definition & Guide | 99helpers | 99helpers.com