Natural Language Processing (NLP)

Sentence Similarity

Definition

Sentence similarity is the NLP task of quantifying the semantic closeness between two text passages. Approaches range from lexical overlap metrics (Jaccard similarity, BLEU) to embedding-based similarity (cosine similarity between sentence vectors from sentence transformers) to cross-encoder models that score pairs directly. The Semantic Textual Similarity (STS) benchmark evaluates models on human-annotated similarity scores from 0 (unrelated) to 5 (equivalent meaning). Sentence transformers achieve Pearson correlations above 0.90 with human judgments on STS benchmarks. Asymmetric semantic search (query vs. long document) requires different models than symmetric sentence similarity.

Why It Matters

Sentence similarity is the core primitive underlying semantic search, duplicate question detection, answer relevance scoring, and chatbot response evaluation. In RAG systems, the relevance of retrieved chunks is measured by their semantic similarity to the query. Knowledge base deduplication uses sentence similarity to identify redundant articles covering the same topic. For AI response quality evaluation, sentence similarity between model outputs and reference answers provides an automated relevance metric. High-quality sentence similarity models are among the most practically useful NLP components.

How It Works

Bi-encoder sentence similarity uses separate encoders for each sentence, producing independent vector representations. Cosine similarity between these vectors provides a similarity score scalable to millions of pairs via vector databases and ANN search. Cross-encoder similarity processes both sentences jointly, producing higher-quality scores but at O(n) inference cost per query (cannot pre-compute). The sentence-transformers library provides bi-encoder models (all-MiniLM-L6-v2, all-mpnet-base-v2) optimized for both quality and speed. Training uses natural language inference data (entailment pairs as positive examples) and hard negative mining for contrastive learning.

Sentence Similarity — Cosine Similarity Scores

Sentence A
↓ encoder
v₁ ∈ ℝ⁷⁶⁸
cosine(v₁, v₂)
similarity
Sentence B
↓ encoder
v₂ ∈ ℝ⁷⁶⁸
Very Similar0.93
"How do I reset my password?"
"What steps are needed to change my login password?"
Similar0.81
"Cancel my subscription"
"I want to end my membership plan"
Dissimilar0.18
"What is the refund policy?"
"How do I export my data?"
0.0 — Opposite
1.0 — Identical

Real-World Example

A job matching platform uses sentence similarity to match job seeker profiles to job descriptions. Each resume bullet and job requirement is embedded with a sentence transformer; matching scores between all (resume item, job requirement) pairs produce a compatibility matrix. Job postings with average similarity above 0.75 across requirements are surfaced as strong matches. This semantic matching correctly identifies that 'led team of 8 engineers' is highly similar to 'people management experience required' (0.81 cosine similarity) despite zero keyword overlap—recovering matches that keyword-based systems miss.

Common Mistakes

  • Using symmetric similarity for asymmetric tasks—short query vs. long document similarity requires asymmetric models, not symmetric STS models
  • Comparing embeddings from different models—cosine similarity is only meaningful between vectors from the same embedding space
  • Treating high similarity as guaranteed semantic equivalence—'not guilty' and 'guilty' have high surface similarity but opposite meanings; similarity is not entailment

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Sentence Similarity? Sentence Similarity Definition & Guide | 99helpers | 99helpers.com