Natural Language Processing (NLP)
Natural Language Processing (NLP) is the discipline that bridges human communication and machine intelligence. This category spans classical NLP techniques — tokenization, POS tagging, named entity recognition — as well as modern neural approaches used in today's LLMs. Understanding NLP fundamentals helps you design better prompts, interpret model outputs, and troubleshoot language understanding issues in your AI applications.
54 terms in this category
Aspect-Based Sentiment Analysis
Aspect-based sentiment analysis extracts fine-grained sentiment about specific product features or topics within a review—revealing that a customer loves the interface but hates the pricing—rather than returning a single overall sentiment score.
Bag of Words
Bag of words is a text representation model that describes documents by their word frequencies, ignoring grammar and word order, producing fixed-length vectors suitable for classical machine learning algorithms.
BERT
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model pre-trained on massive text corpora that revolutionized NLP by providing rich contextual word representations that dramatically improved nearly every language task.
Constituency Parsing
Constituency parsing breaks a sentence into nested hierarchical phrases—noun phrases, verb phrases, clauses—producing a tree structure that reveals the grammatical constituents of a sentence.
Coreference Resolution
Coreference resolution identifies all expressions in a text that refer to the same real-world entity—linking 'Sarah,' 'she,' and 'the manager' to the same person—enabling coherent multi-sentence understanding.
Corpus
A corpus is a large, structured collection of text used to train, evaluate, and study NLP models—the foundational data resource that determines what language patterns and knowledge a model can learn.
Cross-Lingual Transfer
Cross-lingual transfer is the ability of a model trained on labeled data in one language to perform well on the same task in a different language, enabling low-resource language NLP without collecting large labeled datasets for each language.
Dependency Parsing
Dependency parsing analyzes sentence structure by identifying grammatical relationships between words—subject, object, modifier—forming a tree that reveals who did what to whom in any given sentence.
Dialogue Act Classification
Dialogue act classification labels the communicative function of each utterance—question, statement, request, acknowledgment, greeting—enabling chatbot systems to understand the conversational role of each message and respond appropriately.
Dialogue Management
Dialogue management controls the flow of a multi-turn conversation—deciding what the system should say or do next based on conversation history, current context, and user goals—forming the brain of task-oriented chatbots.
Discourse Analysis
Discourse analysis examines how sentences relate to each other across multi-sentence texts—identifying coherence relations, rhetorical structure, and information flow—enabling understanding of documents as unified communicative acts rather than isolated sentences.
Encoder Model
Encoder models are transformer architectures that process input text bidirectionally to produce rich contextual representations, excelling at understanding tasks like classification, NER, and semantic search rather than text generation.
Entity Extraction
Entity extraction identifies and pulls structured information—like names, dates, locations, and product identifiers—from unstructured text, converting free-form language into queryable data fields.
Information Extraction
Information extraction automatically identifies and structures specific facts from unstructured text—who did what, when, and where—transforming free-form documents into queryable databases.
Intent Detection
Intent detection classifies user messages into predefined categories representing the user's goal—such as 'check order status' or 'report a bug'—enabling chatbots to route queries to the appropriate responses or workflows.
Language Detection
Language detection automatically identifies which human language a text is written in—enabling multilingual systems to route inputs to the correct processing pipeline, translation service, or localized response.
Lemmatization
Lemmatization reduces words to their dictionary base form—their lemma—using morphological analysis and vocabulary lookups, producing linguistically correct roots that improve NLP model accuracy compared to stemming.
Linguistic Annotation
Linguistic annotation is the process of manually or automatically labeling text with linguistic information—such as POS tags, parse trees, named entities, or coreference chains—creating training data for supervised NLP models.
Machine Translation
Machine translation automatically converts text from one natural language to another, enabling multilingual products to serve global users without human translators for every language pair.
Multilingual NLP
Multilingual NLP extends language models and processing pipelines to handle multiple human languages, enabling a single AI system to understand and generate text across languages without building separate models for each.
N-gram
An n-gram is a contiguous sequence of n items—words, characters, or subwords—extracted from text, forming the building block of language models, search indexes, and text similarity algorithms.
Named Entity Recognition (NER)
Named Entity Recognition (NER) is an NLP task that identifies and classifies named entities in text—people, organizations, locations, dates, product names, and other specific items—enabling structured extraction from unstructured text.
Natural Language Generation (NLG)
Natural Language Generation (NLG) is the NLP subfield concerned with automatically producing coherent, fluent, and contextually appropriate text from data, structured inputs, or internal representations.
Natural Language Processing (NLP)
Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, and generate human language—powering applications from chatbots and search engines to translation and sentiment analysis.
Natural Language Understanding (NLU)
Natural Language Understanding (NLU) is the NLP subfield focused on machine comprehension of text—determining meaning, intent, entities, and relationships—enabling AI systems to interpret what humans actually mean.
Out-of-Vocabulary
Out-of-vocabulary (OOV) refers to words or tokens that appear at inference time but were absent from the model's training vocabulary, causing the model to fail to represent them properly and degrading prediction accuracy.
Paraphrase Detection
Paraphrase detection determines whether two text passages express the same meaning using different words, enabling duplicate question detection, semantic search deduplication, and FAQ consolidation.
Part-of-Speech Tagging
Part-of-speech (POS) tagging assigns grammatical labels—noun, verb, adjective, preposition—to each word in a sentence, providing syntactic context that downstream NLP tasks use for deeper language understanding.
Question Answering
Question answering is the NLP task of automatically producing accurate answers to natural language questions, either by extracting spans from documents or generating responses from model knowledge.
Reading Comprehension
Reading comprehension is the NLP task of answering questions about a given passage by locating or generating the answer from within the text, serving as the core capability behind document-grounded chatbots and RAG systems.
Relation Extraction
Relation extraction identifies semantic relationships between entities in text—such as 'founded-by,' 'located-in,' or 'treats'—automatically populating knowledge graphs from unstructured documents.
Semantic Parsing
Semantic parsing converts natural language sentences into formal logical representations—such as SQL queries, executable programs, or knowledge graph queries—enabling AI systems to understand and act on user requests precisely.
Semantic Role Labeling
Semantic role labeling identifies 'who did what to whom, when, where, and why' in a sentence—assigning predicate-argument structure roles that capture the meaning of actions and events in text.
Sentence Similarity
Sentence similarity measures how semantically alike two sentences are—producing a score from 0 to 1—enabling duplicate detection, semantic search, paraphrase identification, and answer relevance evaluation.
Sentence Transformers
Sentence transformers are neural models that produce fixed-size semantic embeddings for entire sentences, enabling efficient semantic similarity search, clustering, and retrieval by representing meaning as comparable vectors.
Sentiment Analysis
Sentiment analysis automatically classifies the emotional tone of text—positive, negative, or neutral—enabling businesses to monitor brand perception, triage support tickets, and understand customer satisfaction at scale.
Sequence Labeling
Sequence labeling assigns a label to each token in an input sequence—such as part-of-speech tags, named entity types, or slot values—enabling fine-grained structural analysis of text at the token level.
Slot Filling
Slot filling extracts specific parameter values from user utterances—such as destination, date, and passenger count from a travel booking query—completing the structured form that a task-oriented chatbot needs to fulfill a request.
Spell Checking
Spell checking automatically detects and corrects misspelled words in text input, improving NLP pipeline accuracy by normalizing noisy user-generated content before it reaches intent classifiers and entity extractors.
Stemming
Stemming reduces words to their root form by stripping suffixes—converting 'running,' 'runs,' and 'ran' to 'run'—enabling search and retrieval systems to match documents regardless of word inflection.
Stop Words
Stop words are high-frequency function words—such as 'the,' 'is,' 'at,' and 'which'—that are filtered out during text preprocessing to reduce noise and focus NLP models on content-bearing words.
Subword Segmentation
Subword segmentation splits words into meaningful sub-units—like 'unbelievable' into 'un', '##believ', '##able'—balancing vocabulary coverage with manageability so NLP models handle rare and unseen words without an explicit unknown token.
Text Classification
Text classification automatically assigns predefined labels to text documents—such as topic, urgency, language, or intent—enabling large-scale categorization of unstructured content without manual review.
Text Normalization
Text normalization standardizes raw text into a consistent format—lowercasing, expanding contractions, removing special characters, and resolving abbreviations—ensuring NLP pipelines receive clean, uniform input.
Text Preprocessing
Text preprocessing is the collection of transformations applied to raw text before NLP model training or inference—including tokenization, normalization, and filtering—determining the quality and consistency of model inputs.
Text Segmentation
Text segmentation divides continuous text into meaningful units—sentences, paragraphs, or topical sections—enabling downstream NLP tasks to process coherent chunks rather than arbitrary character sequences.
Text Summarization
Text summarization automatically condenses long documents into shorter versions that preserve the most important information, enabling rapid review of support tickets, articles, and conversations at scale.
Textual Entailment
Textual entailment determines whether a hypothesis logically follows from a premise—classifying pairs as entailment, contradiction, or neutral—enabling AI systems to reason about logical relationships between statements.
Topic Modeling
Topic modeling is an unsupervised technique that discovers hidden thematic structure in large document collections, automatically grouping documents by abstract topics without requiring labeled training data.
Transformer Encoder
The transformer encoder is a neural network architecture that processes entire input sequences bidirectionally using self-attention, producing rich contextual representations of each token that power state-of-the-art NLP models.
Vocabulary Size
Vocabulary size is the number of unique tokens a language model or NLP system recognizes, determining the trade-off between model expressiveness, memory requirements, and the handling of unseen words.
Word Embeddings
Word embeddings are dense vector representations of words in a continuous numerical space where semantically similar words are positioned close together, enabling machines to understand word meaning through geometry.
Word2Vec
Word2Vec is a landmark neural network model that learns dense word representations from text by predicting words from their context, producing vectors where semantic relationships are encoded as geometric directions.
Zero-Shot Classification
Zero-shot classification assigns labels to text using only natural language descriptions of the categories—requiring no labeled training examples—enabling flexible, rapid deployment of text classifiers for novel categories.