Large Language Models (LLMs)
Large Language Models (LLMs) are the engine behind modern AI assistants, from GPT-4 to Gemini to Claude. This category covers LLM fundamentals — transformer architecture, pre-training, fine-tuning, quantization, and inference optimization — as well as practical deployment concepts like model serving, cost management, and output evaluation. Whether you're a developer integrating an LLM API or a business leader evaluating AI vendors, these terms provide the vocabulary to make informed decisions.
75 terms in this category
Attention Mechanism
The attention mechanism allows neural networks to dynamically focus on relevant parts of the input sequence when processing each token, enabling LLMs to capture long-range relationships and contextual meaning.
Base Model
A base model is a pre-trained LLM that has learned language from massive text data but has not yet been instruction-tuned or aligned—capable of text completion but not reliably following instructions or behaving as an assistant.
Beam Search
Beam search is a decoding algorithm that maintains multiple candidate sequences (beams) in parallel during generation, selecting the overall most probable complete sequence rather than the locally optimal token at each step.
LLM Benchmark
An LLM benchmark is a standardized evaluation dataset and scoring methodology used to compare model capabilities across tasks like reasoning, knowledge, coding, and language understanding.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE) is the subword tokenization algorithm used by most LLMs to build their vocabulary by iteratively merging the most frequent adjacent byte or character pairs in training text.
Catastrophic Forgetting
Catastrophic forgetting is the tendency of neural networks to lose previously learned knowledge when fine-tuned on new data, as the weight updates for the new task overwrite the patterns learned during pre-training.
Chain-of-Thought Prompting
Chain-of-thought prompting instructs an LLM to show its reasoning step by step before giving a final answer, significantly improving accuracy on complex reasoning, math, and multi-step problems.
Constitutional AI
Constitutional AI is Anthropic's alignment technique that trains Claude to evaluate and revise its own responses against a set of principles (a 'constitution'), reducing reliance on human labelers for safety training.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
Direct Preference Optimization (DPO)
DPO is an alignment training technique that achieves RLHF-like improvements in model behavior from human preference data without requiring a separate reward model or reinforcement learning, making alignment training simpler and more stable.
Emergent Abilities
Emergent abilities are capabilities that appear in large language models at certain scale thresholds but are absent in smaller models, seemingly arising unpredictably as model size and training compute increase.
Few-Shot Learning
Few-shot learning provides an LLM with a small number of input-output examples within the prompt, demonstrating the desired task format and behavior without updating model weights.
Fine-Tuning
Fine-tuning adapts a pre-trained LLM to a specific task or domain by continuing training on a smaller, curated dataset, improving performance on targeted use cases while preserving general language capabilities.
Foundation Model
A foundation model is a large AI model trained on broad, diverse data that can be adapted to a wide range of downstream tasks through fine-tuning or prompting, serving as a base for many applications.
Function Calling
Function calling enables LLMs to request the execution of predefined functions with structured arguments, allowing AI systems to interact with external APIs, databases, and tools rather than just generating text.
GPU Inference
GPU inference is the use of graphics processing units to run LLM predictions, leveraging their massive parallel compute capabilities to achieve the high throughput and low latency required for production AI applications.
Greedy Decoding
Greedy decoding selects the single highest-probability token at each generation step, producing deterministic, locally optimal output without exploring alternative sequences.
Guardrails
Guardrails are input and output validation mechanisms layered around LLM calls to detect and block unsafe, off-topic, or non-compliant content, providing application-level safety beyond the model's built-in alignment.
Hallucination
Hallucination is when an LLM generates confident-sounding but factually incorrect or entirely fabricated information, including false citations, nonexistent facts, or plausible-sounding but wrong technical details.
In-Context Learning
In-context learning is the LLM phenomenon of adapting to new tasks purely from examples or instructions provided in the prompt, without updating model weights—including zero-shot, one-shot, and few-shot scenarios.
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
Instruction Tuning
Instruction tuning fine-tunes a pre-trained language model on diverse (instruction, response) pairs, transforming a text-completion model into an assistant that reliably follows human directives.
JSON Mode
JSON mode is an LLM API feature that guarantees the model's output is valid JSON, ensuring reliable programmatic parsing without worrying about prose text surrounding the JSON object.
KV Cache
The KV cache stores the key and value attention tensors computed during the prefill phase, allowing subsequent token generation to reuse these computations rather than recomputing them for every new token.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
LLM Agent
An LLM agent is an AI system that uses a language model as its reasoning core, autonomously planning and executing multi-step tasks by calling tools, observing results, and iterating until the goal is achieved.
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
Fine-Tuning Dataset
A fine-tuning dataset is a curated collection of (prompt, response) pairs used to adapt an LLM's behavior for a specific domain, task, or style—the quality and quantity of which directly determines fine-tuning success.
LLM Leaderboard
An LLM leaderboard is a public ranking of language models by benchmark performance or human preference, enabling model comparison and tracking progress in the field.
LLM Observability
LLM observability is the practice of monitoring, logging, and analyzing LLM application behavior in production—tracking quality metrics, latency, costs, errors, and user interactions to maintain and improve system performance.
LLM Router
An LLM router dynamically selects which language model to use for each query based on complexity, cost requirements, or domain, routing simple queries to cheaper models and complex queries to more capable ones.
Logit Bias
Logit bias allows manual adjustment of token probabilities in LLM generation, enabling developers to increase or decrease the likelihood of specific tokens being generated—useful for content filtering and output control.
Log Probabilities (Logprobs)
Logprobs are the log-probabilities the LLM assigns to each token in its output, enabling applications to measure generation confidence, detect low-certainty completions, and implement custom sampling strategies.
LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning technique that injects small trainable low-rank matrices into LLM layers, updating less than 1% of parameters while achieving quality comparable to full fine-tuning.
Max Tokens
Max tokens is an LLM API parameter that limits the maximum number of tokens the model can generate in a single response, controlling response length, cost, and latency.
Mixture of Experts (MoE)
Mixture of Experts is an LLM architecture where a router dynamically activates only a subset of specialized 'expert' neural network layers for each token, achieving high model capacity while keeping per-token compute cost manageable.
Model Alignment
Model alignment is the process of training LLMs to behave in ways that are helpful, harmless, and honest, ensuring outputs match human values and intentions rather than just optimizing for text prediction.
Model Card
A model card is a structured documentation artifact that describes an LLM's training data, intended uses, limitations, biases, evaluation results, and ethical considerations, enabling informed deployment decisions.
Model Compression
Model compression reduces LLM size through techniques like pruning (removing unimportant weights), quantization (reducing weight precision), and distillation (training smaller models), enabling deployment on resource-constrained hardware.
Model Distillation
Model distillation trains a smaller 'student' model to mimic a larger 'teacher' model's outputs, producing a compact model that approximates the teacher's capabilities at a fraction of the compute cost.
Model Evaluation
Model evaluation is the systematic process of measuring an LLM's performance on relevant tasks and quality dimensions, guiding decisions about model selection, fine-tuning, and deployment readiness.
Model Parameters
Model parameters are the learned numerical weights of an LLM—billions of floating-point values encoding the model's knowledge and capabilities—whose count (e.g., 7B, 70B, 405B) is the primary indicator of model capacity.
Model Provider
A model provider is a company that trains and serves large language models through APIs—including OpenAI, Anthropic, Google, Mistral, and Meta—offering different models with varying capability, cost, and privacy characteristics.
Model Quantization
Model quantization reduces the numerical precision of LLM weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers, dramatically reducing memory requirements and inference costs with minimal quality loss.
Multi-Head Attention
Multi-head attention runs multiple independent self-attention operations ('heads') in parallel, allowing the transformer to simultaneously capture different types of relationships between tokens from different representation subspaces.
Multi-Turn Conversation
A multi-turn conversation is a dialogue with an LLM that spans multiple message exchanges, where the model maintains context across turns to produce coherent, contextually aware responses throughout the session.
Multimodal LLM
A multimodal LLM can process and reason over multiple input types—text, images, audio, video, or documents—extending language model capabilities beyond pure text to enable vision, document understanding, and cross-modal reasoning.
Open-Source LLM
An open-source LLM is a language model with publicly available weights that anyone can download, run locally, fine-tune, and deploy without per-query licensing fees, enabling private deployment and customization.
Parameter-Efficient Fine-Tuning (PEFT)
PEFT encompasses techniques like LoRA, prefix tuning, and adapters that fine-tune only a small fraction of LLM parameters, achieving comparable quality to full fine-tuning at dramatically reduced compute and memory cost.
Perplexity
Perplexity measures how well a language model predicts a text sample—lower perplexity indicates the model assigns higher probability to the text, indicating better language modeling quality.
Positional Encoding
Positional encoding provides transformers with information about token order, since self-attention is order-agnostic by default—enabling LLMs to understand that 'dog bites man' differs from 'man bites dog'.
Pre-Training
Pre-training is the foundational phase of LLM development where the model learns language understanding and world knowledge by predicting the next token across vast text corpora, before any task-specific optimization.
Prompt Caching
Prompt caching is an LLM API feature that stores the computed KV cache state of a common prompt prefix server-side, so repeated requests sharing that prefix can skip its processing—reducing latency and input token costs.
Prompt Injection
Prompt injection is a security vulnerability where malicious content in user input or retrieved data overrides an LLM's instructions, potentially causing it to bypass safety measures, leak confidential information, or perform unintended actions.
QLoRA
QLoRA (Quantized Low-Rank Adaptation) combines 4-bit model quantization with LoRA fine-tuning, enabling fine-tuning of large LLMs on consumer-grade hardware by dramatically reducing memory requirements.
Reasoning Model
A reasoning model is an LLM that explicitly 'thinks' through problems in an extended internal reasoning process before producing a final answer, trading inference speed for dramatically improved accuracy on complex tasks.
Red-Teaming
Red-teaming for LLMs is the practice of adversarially probing a model to discover safety failures, harmful behaviors, and alignment gaps before deployment by simulating malicious or misuse-oriented user inputs.
Reinforcement Learning from Human Feedback (RLHF)
RLHF is a training technique that improves LLM alignment with human preferences by training a reward model on human preference data, then using reinforcement learning to update the LLM to maximize this reward.
Safety Training
Safety training is the process of fine-tuning LLMs to refuse harmful requests, avoid dangerous content generation, and behave safely across adversarial inputs while maintaining helpfulness for legitimate use cases.
Scaling Laws
Scaling laws describe predictable mathematical relationships between LLM performance and scale—model size, training data, and compute—enabling researchers to forecast model capability improvements before building larger models.
Self-Attention
Self-attention is the core operation in transformer models where each token computes a weighted representation of all other tokens in the sequence, enabling every position to directly access information from every other position.
Speculative Decoding
Speculative decoding uses a small 'draft' model to generate multiple candidate tokens quickly, then verifies them in parallel with the large target model, achieving 2-3x inference speedup without changing output quality.
Stop Sequence
A stop sequence is a string or list of strings that signals the LLM to stop generating when encountered, enabling precise control over where the response ends—useful for constrained generation and multi-turn dialogue management.
Structured Output
Structured output constrains LLM responses to follow a specific format—typically JSON with defined fields—enabling reliable parsing and integration with downstream systems rather than free-form text generation.
Sycophancy
Sycophancy is an LLM alignment failure where the model tells users what they want to hear rather than what is accurate, changing its stated positions to agree with user preferences even when the user is wrong.
System Prompt
A system prompt is a privileged instruction set provided to an LLM before the conversation begins, establishing the assistant's role, behavior, constraints, and capabilities for the entire session.
Temperature
Temperature is an LLM parameter (0-2) that controls output randomness: low values produce focused, deterministic responses while high values produce more varied, creative outputs.
Token Streaming
Token streaming delivers LLM responses to the user progressively as each token is generated, rather than waiting for the complete response, dramatically improving perceived responsiveness and enabling real-time interaction.
Token
A token is the basic unit of text an LLM processes—roughly 4 characters or 3/4 of an English word. LLM APIs charge per token, context windows are measured in tokens, and generation speed is measured in tokens per second.
Tokenization
Tokenization converts raw text into a sequence of tokens—the basic units an LLM processes—using algorithms like byte-pair encoding that split text into subword pieces rather than whole words or individual characters.
Tool Use
Tool use is the broader capability of LLMs to interact with external systems—executing code, browsing the web, querying databases, reading files—by calling tools during generation to retrieve information or take actions.
Top-K Sampling
Top-K sampling restricts token generation to the K most probable next tokens at each step, preventing the model from selecting rare or unlikely tokens while maintaining diversity within the top-K candidates.
Top-P Sampling (Nucleus Sampling)
Top-p sampling (nucleus sampling) restricts token generation to the smallest set of tokens whose cumulative probability exceeds p, dynamically adapting the candidate pool size based on the probability distribution.
Transformer
The transformer is the neural network architecture underlying all modern LLMs, using self-attention mechanisms to process entire input sequences in parallel and capture long-range dependencies between words.
Zero-Shot Learning
Zero-shot learning is the ability of LLMs to perform tasks from natural language instructions alone, without any task-specific examples, by generalizing from pre-training knowledge to new task types.