Retrieval-Augmented Generation (RAG)

Vector Database

Definition

A vector database stores embedding vectors — the numerical representations of text, images, or other data produced by machine learning models — alongside the original content and associated metadata. Its defining capability is approximate nearest neighbor (ANN) search: given a query vector, it efficiently finds the k vectors in the database most similar to the query vector using metrics like cosine similarity or dot product. Vector databases are the retrieval engine in RAG architectures, enabling AI systems to semantically search across millions of document chunks in milliseconds. Popular vector databases include Pinecone, Weaviate, Chroma, Milvus, Qdrant, and the pgvector extension for PostgreSQL.

Why It Matters

Vector databases are the infrastructure that makes semantic search and RAG possible at production scale. Traditional databases (SQL, NoSQL) are optimized for exact lookup and keyword search — they cannot find semantically similar content across millions of documents in milliseconds. Vector databases solve this with specialized indexing algorithms (HNSW, IVF, LSH) that enable approximate nearest neighbor search at scale. For AI chatbot deployments, the vector database is where the knowledge base is stored in searchable form — its performance (latency, recall, cost) directly impacts chatbot response quality and speed.

How It Works

Vector databases work by organizing embedding vectors in specialized index structures that enable fast similarity search without comparing every vector against the query. The HNSW (Hierarchical Navigable Small World) algorithm, used by most modern vector databases, organizes vectors in a graph structure that allows O(log n) search complexity rather than O(n). When a query vector arrives, the index traverses this graph to find approximate nearest neighbors efficiently. Metadata filters can be applied alongside vector search to restrict results (e.g., 'find semantically similar chunks, but only from articles in the billing category'). Vector databases also handle upserts (adding or updating vectors), deletions, and namespace management for multi-tenant applications.

Vector DB vs SQL DB — Storage and Query Model

SQL Database

SELECT * FROM docs WHERE id = 42

Query type:Exact match

Storage:Row-based table storage

Index:B-tree, hash index

Speed:Fast for lookups by key

Example:Find doc by ID

Vector Database

query_vector = embed(input) → top-5 by cosine

Query type:Similarity search

Storage:Vector index storage

Index:HNSW / IVF index

Speed:Fast for ANN search

Example:Find semantically similar docs

Example: 1M document vectors — query in <10ms

1.Embed user query → [0.23, -0.44, 0.71, ...]

2.HNSW index traversal → candidate set (~200 docs)

3.Score candidates → rank by cosine similarity

4.Return top-5 with scores + metadata

Top-5 results returned

1.How to reset your password

0.97

2.Account recovery options

0.91

3.Two-factor authentication setup

0.84

4.Login troubleshooting guide

0.79

5.Security settings overview

0.73

Core operations

upsert

Insert or update vector

query

ANN similarity search

delete

Remove by ID

filter

Metadata pre-filter

Real-World Example

A 99helpers customer builds their AI chatbot knowledge base in a vector database with 15,000 document chunks across 500 knowledge base articles. When a user asks a question, the system embeds the query and searches the vector database for the 5 most semantically similar chunks — completing the search in under 20 milliseconds. The retrieved chunks are passed to the LLM as context. The entire retrieval-to-response latency is under 2 seconds, meeting the real-time chat experience requirement.

Common Mistakes

✕Choosing a vector database based on benchmark performance alone without considering operational factors (managed vs. self-hosted, cost, developer experience)
✕Not implementing metadata filtering — filtering by category, date, or document source dramatically improves retrieval precision by reducing the candidate set
✕Embedding full documents as single vectors instead of chunked passages — long documents lose granular semantic meaning; chunk before embedding

Related Terms

Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language model responses by first retrieving relevant documents from an external knowledge base and then using that retrieved content as context when generating an answer.

Embedding Model

An embedding model is a machine learning model that converts text (or other data) into dense numerical vectors that capture semantic meaning, enabling similarity search and serving as the foundation of RAG retrieval systems.

Approximate Nearest Neighbor

Approximate Nearest Neighbor (ANN) search finds vectors that are close to a query vector with high probability but without guaranteeing exactness, enabling fast similarity search across millions of vectors at the cost of small accuracy tradeoffs.

Cosine Similarity

Cosine similarity is a mathematical metric that measures the similarity between two vectors by calculating the cosine of the angle between them, producing a score from -1 to 1 where 1 indicates identical direction and is widely used in RAG and semantic search.

Indexing Pipeline

An indexing pipeline is the offline data processing workflow that transforms raw documents into searchable vector embeddings, running during knowledge base setup and when content is updated.

← Retrieval-Augmented Generation (RAG)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →