Sentence Similarity
Definition
Sentence similarity is the NLP task of quantifying the semantic closeness between two text passages. Approaches range from lexical overlap metrics (Jaccard similarity, BLEU) to embedding-based similarity (cosine similarity between sentence vectors from sentence transformers) to cross-encoder models that score pairs directly. The Semantic Textual Similarity (STS) benchmark evaluates models on human-annotated similarity scores from 0 (unrelated) to 5 (equivalent meaning). Sentence transformers achieve Pearson correlations above 0.90 with human judgments on STS benchmarks. Asymmetric semantic search (query vs. long document) requires different models than symmetric sentence similarity.
Why It Matters
Sentence similarity is the core primitive underlying semantic search, duplicate question detection, answer relevance scoring, and chatbot response evaluation. In RAG systems, the relevance of retrieved chunks is measured by their semantic similarity to the query. Knowledge base deduplication uses sentence similarity to identify redundant articles covering the same topic. For AI response quality evaluation, sentence similarity between model outputs and reference answers provides an automated relevance metric. High-quality sentence similarity models are among the most practically useful NLP components.
How It Works
Bi-encoder sentence similarity uses separate encoders for each sentence, producing independent vector representations. Cosine similarity between these vectors provides a similarity score scalable to millions of pairs via vector databases and ANN search. Cross-encoder similarity processes both sentences jointly, producing higher-quality scores but at O(n) inference cost per query (cannot pre-compute). The sentence-transformers library provides bi-encoder models (all-MiniLM-L6-v2, all-mpnet-base-v2) optimized for both quality and speed. Training uses natural language inference data (entailment pairs as positive examples) and hard negative mining for contrastive learning.
Sentence Similarity — Cosine Similarity Scores
Real-World Example
A job matching platform uses sentence similarity to match job seeker profiles to job descriptions. Each resume bullet and job requirement is embedded with a sentence transformer; matching scores between all (resume item, job requirement) pairs produce a compatibility matrix. Job postings with average similarity above 0.75 across requirements are surfaced as strong matches. This semantic matching correctly identifies that 'led team of 8 engineers' is highly similar to 'people management experience required' (0.81 cosine similarity) despite zero keyword overlap—recovering matches that keyword-based systems miss.
Common Mistakes
- ✕Using symmetric similarity for asymmetric tasks—short query vs. long document similarity requires asymmetric models, not symmetric STS models
- ✕Comparing embeddings from different models—cosine similarity is only meaningful between vectors from the same embedding space
- ✕Treating high similarity as guaranteed semantic equivalence—'not guilty' and 'guilty' have high surface similarity but opposite meanings; similarity is not entailment
Related Terms
Sentence Transformers
Sentence transformers are neural models that produce fixed-size semantic embeddings for entire sentences, enabling efficient semantic similarity search, clustering, and retrieval by representing meaning as comparable vectors.
Paraphrase Detection
Paraphrase detection determines whether two text passages express the same meaning using different words, enabling duplicate question detection, semantic search deduplication, and FAQ consolidation.
Word Embeddings
Word embeddings are dense vector representations of words in a continuous numerical space where semantically similar words are positioned close together, enabling machines to understand word meaning through geometry.
Semantic Search
Semantic search finds knowledge base articles based on the meaning of a query — not just the words used. By converting both queries and documents into vector embeddings, it identifies conceptually similar content even when users use different terminology than the articles, enabling more natural and accurate information retrieval.
Natural Language Processing (NLP)
Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, and generate human language—powering applications from chatbots and search engines to translation and sentiment analysis.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →