Retrieval-Augmented Generation (RAG)

Vector Quantization

Definition

Vector quantization reduces the memory footprint of embedding vectors, which are typically stored as 32-bit floats (4 bytes per dimension × 1536 dimensions = 6KB per vector). At scale—millions of vectors—storage becomes expensive. Quantization approximates each vector using a codebook of discrete representations. Product Quantization (PQ) splits the vector into sub-vectors and maps each to the nearest centroid in a pre-trained codebook, reducing storage by 16-64x. Scalar Quantization (SQ8) stores each dimension as an 8-bit integer instead of 32-bit float, reducing storage by 4x with minimal accuracy loss. Binary Quantization converts each dimension to a single bit (0/1), achieving 32x compression with moderate accuracy loss for models that support it.

Why It Matters

Vector database costs scale with the number of stored vectors and their dimensionality. A 1-million vector index at 1536 dimensions using float32 requires ~6GB of memory; at 3072 dimensions (text-embedding-3-large), 12GB. For large 99helpers deployments indexing tens of millions of knowledge base chunks, raw float storage becomes prohibitively expensive. Vector quantization reduces memory requirements 4-64x, directly cutting vector database infrastructure costs while enabling larger indexes to fit in RAM for fast retrieval. The tradeoff—slightly lower retrieval accuracy—is typically acceptable when using a reranker to refine the approximate nearest neighbor results.

How It Works

Pinecone supports int8 scalar quantization natively. Weaviate supports PQ and HNSW with SQ compression. For FAISS (used locally), IndexIVFPQ combines inverted file indexing with product quantization: index = faiss.index_factory(1536, 'IVF1024,PQ48') creates an index with 1024 Voronoi cells and 48 product quantization sub-vectors, reducing memory from 6KB to ~48 bytes per vector (125x compression). Quality tradeoffs are evaluated by comparing recall@10 with and without quantization on a representative query set. A recall drop from 0.95 to 0.90 with 16x compression is often an acceptable tradeoff for large-scale deployments.

Vector Quantization — Precision vs Size vs Speed

float32

Full precision

Bits/dim:32

KB/vector:6

1M vec:6.1GB

Accuracy:100%

Speed:1×

int8

8-bit quantized

Bits/dim:8

KB/vector:1.5

1M vec:1.5GB

Accuracy:98%

Speed:4×

binary

Binary quantized

Bits/dim:1

KB/vector:0.19

1M vec:0.19GB

Accuracy:90%

Speed:32×

Storage for 1M vectors (1,536 dims)

float32

6.1GB

int8

1.5GB

binary

0.19GB

Search speed vs float32 baseline

float32

1×

int8

4×

binary

32×

Accuracy vs compression

float32

100%1× comp

int8

98%4× comp

binary

90%32× comp

Binary quantization shrinks storage 32× with only a 10% accuracy drop — often an acceptable trade.

Real-World Example

A 99helpers deployment indexes 50 million knowledge base chunks at 1536 dimensions. Without quantization, the vector index requires 300GB of RAM—impractical for a single server. With int8 scalar quantization (4x compression), memory drops to 75GB, fitting in a cost-effective high-memory instance. Retrieval recall drops from 0.94 to 0.91, but since a cross-encoder reranker refines the top-50 approximate results to top-5, the final answer quality impact is negligible. Monthly infrastructure cost drops from $8,400 (3x 100GB RAM servers) to $2,200 (1x 100GB RAM server).

Common Mistakes

✕Applying aggressive quantization (binary or PQ with many sub-vectors) without evaluating recall@K before and after—quality impact varies significantly by dataset and model.
✕Quantizing without using a reranker—quantization's approximate retrieval benefit is partially offset by lower recall; a reranker compensates by refining approximate results.
✕Forgetting that quantization requires training on representative data (PQ codebook training)—using a poorly trained codebook on different data produces worse compression quality.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Vector Quantization

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Vector Database

Approximate Nearest Neighbor

Embedding Model

Retrieval Precision

Pinecone

Ready to build your AI chatbot?