Natural Language Processing (NLP)

Sentence Similarity

Definition

Sentence similarity is the NLP task of quantifying the semantic closeness between two text passages. Approaches range from lexical overlap metrics (Jaccard similarity, BLEU) to embedding-based similarity (cosine similarity between sentence vectors from sentence transformers) to cross-encoder models that score pairs directly. The Semantic Textual Similarity (STS) benchmark evaluates models on human-annotated similarity scores from 0 (unrelated) to 5 (equivalent meaning). Sentence transformers achieve Pearson correlations above 0.90 with human judgments on STS benchmarks. Asymmetric semantic search (query vs. long document) requires different models than symmetric sentence similarity.

Why It Matters

Sentence similarity is the core primitive underlying semantic search, duplicate question detection, answer relevance scoring, and chatbot response evaluation. In RAG systems, the relevance of retrieved chunks is measured by their semantic similarity to the query. Knowledge base deduplication uses sentence similarity to identify redundant articles covering the same topic. For AI response quality evaluation, sentence similarity between model outputs and reference answers provides an automated relevance metric. High-quality sentence similarity models are among the most practically useful NLP components.

How It Works

Bi-encoder sentence similarity uses separate encoders for each sentence, producing independent vector representations. Cosine similarity between these vectors provides a similarity score scalable to millions of pairs via vector databases and ANN search. Cross-encoder similarity processes both sentences jointly, producing higher-quality scores but at O(n) inference cost per query (cannot pre-compute). The sentence-transformers library provides bi-encoder models (all-MiniLM-L6-v2, all-mpnet-base-v2) optimized for both quality and speed. Training uses natural language inference data (entailment pairs as positive examples) and hard negative mining for contrastive learning.

Sentence Similarity — Cosine Similarity Scores

Sentence A

↓ encoder

v₁ ∈ ℝ⁷⁶⁸

cosine(v₁, v₂)

similarity

Sentence B

↓ encoder

v₂ ∈ ℝ⁷⁶⁸

Very Similar0.93

"How do I reset my password?"

"What steps are needed to change my login password?"

Similar0.81

"Cancel my subscription"

"I want to end my membership plan"

Dissimilar0.18

"What is the refund policy?"

"How do I export my data?"

0.0 — Opposite

1.0 — Identical

Real-World Example

A job matching platform uses sentence similarity to match job seeker profiles to job descriptions. Each resume bullet and job requirement is embedded with a sentence transformer; matching scores between all (resume item, job requirement) pairs produce a compatibility matrix. Job postings with average similarity above 0.75 across requirements are surfaced as strong matches. This semantic matching correctly identifies that 'led team of 8 engineers' is highly similar to 'people management experience required' (0.81 cosine similarity) despite zero keyword overlap—recovering matches that keyword-based systems miss.

Common Mistakes

✕Using symmetric similarity for asymmetric tasks—short query vs. long document similarity requires asymmetric models, not symmetric STS models
✕Comparing embeddings from different models—cosine similarity is only meaningful between vectors from the same embedding space
✕Treating high similarity as guaranteed semantic equivalence—'not guilty' and 'guilty' have high surface similarity but opposite meanings; similarity is not entailment

Related Terms

Sentence Transformers

Sentence transformers are neural models that produce fixed-size semantic embeddings for entire sentences, enabling efficient semantic similarity search, clustering, and retrieval by representing meaning as comparable vectors.

Paraphrase Detection

Paraphrase detection determines whether two text passages express the same meaning using different words, enabling duplicate question detection, semantic search deduplication, and FAQ consolidation.

Word Embeddings

Word embeddings are dense vector representations of words in a continuous numerical space where semantically similar words are positioned close together, enabling machines to understand word meaning through geometry.

Semantic Search

Semantic search finds knowledge base articles based on the meaning of a query — not just the words used. By converting both queries and documents into vector embeddings, it identifies conceptually similar content even when users use different terminology than the articles, enabling more natural and accurate information retrieval.

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, and generate human language—powering applications from chatbots and search engines to translation and sentiment analysis.

← Natural Language Processing (NLP)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →