Large Language Models (LLMs)

Large Language Models (LLMs) are the engine behind modern AI assistants, from GPT-4 to Gemini to Claude. This category covers LLM fundamentals — transformer architecture, pre-training, fine-tuning, quantization, and inference optimization — as well as practical deployment concepts like model serving, cost management, and output evaluation. Whether you're a developer integrating an LLM API or a business leader evaluating AI vendors, these terms provide the vocabulary to make informed decisions.

75 terms in this category

Attention Mechanism

The attention mechanism allows neural networks to dynamically focus on relevant parts of the input sequence when processing each token, enabling LLMs to capture long-range relationships and contextual meaning.

Base Model

A base model is a pre-trained LLM that has learned language from massive text data but has not yet been instruction-tuned or aligned—capable of text completion but not reliably following instructions or behaving as an assistant.

Beam Search

Beam search is a decoding algorithm that maintains multiple candidate sequences (beams) in parallel during generation, selecting the overall most probable complete sequence rather than the locally optimal token at each step.

LLM Benchmark

An LLM benchmark is a standardized evaluation dataset and scoring methodology used to compare model capabilities across tasks like reasoning, knowledge, coding, and language understanding.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is the subword tokenization algorithm used by most LLMs to build their vocabulary by iteratively merging the most frequent adjacent byte or character pairs in training text.

Catastrophic Forgetting

Catastrophic forgetting is the tendency of neural networks to lose previously learned knowledge when fine-tuned on new data, as the weight updates for the new task overwrite the patterns learned during pre-training.

Chain-of-Thought Prompting

Chain-of-thought prompting instructs an LLM to show its reasoning step by step before giving a final answer, significantly improving accuracy on complex reasoning, math, and multi-step problems.

Constitutional AI

Constitutional AI is Anthropic's alignment technique that trains Claude to evaluate and revise its own responses against a set of principles (a 'constitution'), reducing reliance on human labelers for safety training.

Context Length

Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.

Direct Preference Optimization (DPO)

DPO is an alignment training technique that achieves RLHF-like improvements in model behavior from human preference data without requiring a separate reward model or reinforcement learning, making alignment training simpler and more stable.

Emergent Abilities

Emergent abilities are capabilities that appear in large language models at certain scale thresholds but are absent in smaller models, seemingly arising unpredictably as model size and training compute increase.

Few-Shot Learning

Few-shot learning provides an LLM with a small number of input-output examples within the prompt, demonstrating the desired task format and behavior without updating model weights.

Fine-Tuning

Fine-tuning adapts a pre-trained LLM to a specific task or domain by continuing training on a smaller, curated dataset, improving performance on targeted use cases while preserving general language capabilities.

Foundation Model

A foundation model is a large AI model trained on broad, diverse data that can be adapted to a wide range of downstream tasks through fine-tuning or prompting, serving as a base for many applications.

Function Calling

Function calling enables LLMs to request the execution of predefined functions with structured arguments, allowing AI systems to interact with external APIs, databases, and tools rather than just generating text.

GPU Inference

GPU inference is the use of graphics processing units to run LLM predictions, leveraging their massive parallel compute capabilities to achieve the high throughput and low latency required for production AI applications.

Greedy Decoding

Greedy decoding selects the single highest-probability token at each generation step, producing deterministic, locally optimal output without exploring alternative sequences.

Guardrails

Guardrails are input and output validation mechanisms layered around LLM calls to detect and block unsafe, off-topic, or non-compliant content, providing application-level safety beyond the model's built-in alignment.

Hallucination

Hallucination is when an LLM generates confident-sounding but factually incorrect or entirely fabricated information, including false citations, nonexistent facts, or plausible-sounding but wrong technical details.

In-Context Learning

In-context learning is the LLM phenomenon of adapting to new tasks purely from examples or instructions provided in the prompt, without updating model weights—including zero-shot, one-shot, and few-shot scenarios.

LLM Inference

LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.

Instruction Tuning

Instruction tuning fine-tunes a pre-trained language model on diverse (instruction, response) pairs, transforming a text-completion model into an assistant that reliably follows human directives.

JSON Mode

JSON mode is an LLM API feature that guarantees the model's output is valid JSON, ensuring reliable programmatic parsing without worrying about prose text surrounding the JSON object.

KV Cache

The KV cache stores the key and value attention tensors computed during the prefill phase, allowing subsequent token generation to reuse these computations rather than recomputing them for every new token.

Large Language Model (LLM)

A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.

LLM Agent

An LLM agent is an AI system that uses a language model as its reasoning core, autonomously planning and executing multi-step tasks by calling tools, observing results, and iterating until the goal is achieved.

LLM API

An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.

Fine-Tuning Dataset

A fine-tuning dataset is a curated collection of (prompt, response) pairs used to adapt an LLM's behavior for a specific domain, task, or style—the quality and quantity of which directly determines fine-tuning success.

LLM Leaderboard

An LLM leaderboard is a public ranking of language models by benchmark performance or human preference, enabling model comparison and tracking progress in the field.

LLM Observability

LLM observability is the practice of monitoring, logging, and analyzing LLM application behavior in production—tracking quality metrics, latency, costs, errors, and user interactions to maintain and improve system performance.

LLM Router

An LLM router dynamically selects which language model to use for each query based on complexity, cost requirements, or domain, routing simple queries to cheaper models and complex queries to more capable ones.

Logit Bias

Logit bias allows manual adjustment of token probabilities in LLM generation, enabling developers to increase or decrease the likelihood of specific tokens being generated—useful for content filtering and output control.

Log Probabilities (Logprobs)

Logprobs are the log-probabilities the LLM assigns to each token in its output, enabling applications to measure generation confidence, detect low-certainty completions, and implement custom sampling strategies.

LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning technique that injects small trainable low-rank matrices into LLM layers, updating less than 1% of parameters while achieving quality comparable to full fine-tuning.

Max Tokens

Max tokens is an LLM API parameter that limits the maximum number of tokens the model can generate in a single response, controlling response length, cost, and latency.

Mixture of Experts (MoE)

Mixture of Experts is an LLM architecture where a router dynamically activates only a subset of specialized 'expert' neural network layers for each token, achieving high model capacity while keeping per-token compute cost manageable.

Model Alignment

Model alignment is the process of training LLMs to behave in ways that are helpful, harmless, and honest, ensuring outputs match human values and intentions rather than just optimizing for text prediction.

Model Card

A model card is a structured documentation artifact that describes an LLM's training data, intended uses, limitations, biases, evaluation results, and ethical considerations, enabling informed deployment decisions.

Model Compression

Model compression reduces LLM size through techniques like pruning (removing unimportant weights), quantization (reducing weight precision), and distillation (training smaller models), enabling deployment on resource-constrained hardware.

Model Distillation

Model distillation trains a smaller 'student' model to mimic a larger 'teacher' model's outputs, producing a compact model that approximates the teacher's capabilities at a fraction of the compute cost.

Model Evaluation

Model evaluation is the systematic process of measuring an LLM's performance on relevant tasks and quality dimensions, guiding decisions about model selection, fine-tuning, and deployment readiness.

Model Parameters

Model parameters are the learned numerical weights of an LLM—billions of floating-point values encoding the model's knowledge and capabilities—whose count (e.g., 7B, 70B, 405B) is the primary indicator of model capacity.

Model Provider

A model provider is a company that trains and serves large language models through APIs—including OpenAI, Anthropic, Google, Mistral, and Meta—offering different models with varying capability, cost, and privacy characteristics.

Model Quantization

Model quantization reduces the numerical precision of LLM weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers, dramatically reducing memory requirements and inference costs with minimal quality loss.

Multi-Head Attention

Multi-head attention runs multiple independent self-attention operations ('heads') in parallel, allowing the transformer to simultaneously capture different types of relationships between tokens from different representation subspaces.

Multi-Turn Conversation

A multi-turn conversation is a dialogue with an LLM that spans multiple message exchanges, where the model maintains context across turns to produce coherent, contextually aware responses throughout the session.

Multimodal LLM

A multimodal LLM can process and reason over multiple input types—text, images, audio, video, or documents—extending language model capabilities beyond pure text to enable vision, document understanding, and cross-modal reasoning.

Open-Source LLM

An open-source LLM is a language model with publicly available weights that anyone can download, run locally, fine-tune, and deploy without per-query licensing fees, enabling private deployment and customization.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT encompasses techniques like LoRA, prefix tuning, and adapters that fine-tune only a small fraction of LLM parameters, achieving comparable quality to full fine-tuning at dramatically reduced compute and memory cost.

Perplexity

Perplexity measures how well a language model predicts a text sample—lower perplexity indicates the model assigns higher probability to the text, indicating better language modeling quality.

Positional Encoding

Positional encoding provides transformers with information about token order, since self-attention is order-agnostic by default—enabling LLMs to understand that 'dog bites man' differs from 'man bites dog'.

Pre-Training

Pre-training is the foundational phase of LLM development where the model learns language understanding and world knowledge by predicting the next token across vast text corpora, before any task-specific optimization.

Prompt Caching

Prompt caching is an LLM API feature that stores the computed KV cache state of a common prompt prefix server-side, so repeated requests sharing that prefix can skip its processing—reducing latency and input token costs.

Prompt Injection

Prompt injection is a security vulnerability where malicious content in user input or retrieved data overrides an LLM's instructions, potentially causing it to bypass safety measures, leak confidential information, or perform unintended actions.

QLoRA

QLoRA (Quantized Low-Rank Adaptation) combines 4-bit model quantization with LoRA fine-tuning, enabling fine-tuning of large LLMs on consumer-grade hardware by dramatically reducing memory requirements.

Reasoning Model

A reasoning model is an LLM that explicitly 'thinks' through problems in an extended internal reasoning process before producing a final answer, trading inference speed for dramatically improved accuracy on complex tasks.

Red-Teaming

Red-teaming for LLMs is the practice of adversarially probing a model to discover safety failures, harmful behaviors, and alignment gaps before deployment by simulating malicious or misuse-oriented user inputs.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a training technique that improves LLM alignment with human preferences by training a reward model on human preference data, then using reinforcement learning to update the LLM to maximize this reward.

Safety Training

Safety training is the process of fine-tuning LLMs to refuse harmful requests, avoid dangerous content generation, and behave safely across adversarial inputs while maintaining helpfulness for legitimate use cases.

Scaling Laws

Scaling laws describe predictable mathematical relationships between LLM performance and scale—model size, training data, and compute—enabling researchers to forecast model capability improvements before building larger models.

Self-Attention

Self-attention is the core operation in transformer models where each token computes a weighted representation of all other tokens in the sequence, enabling every position to directly access information from every other position.

Speculative Decoding

Speculative decoding uses a small 'draft' model to generate multiple candidate tokens quickly, then verifies them in parallel with the large target model, achieving 2-3x inference speedup without changing output quality.

Stop Sequence

A stop sequence is a string or list of strings that signals the LLM to stop generating when encountered, enabling precise control over where the response ends—useful for constrained generation and multi-turn dialogue management.

Structured Output

Structured output constrains LLM responses to follow a specific format—typically JSON with defined fields—enabling reliable parsing and integration with downstream systems rather than free-form text generation.

Sycophancy

Sycophancy is an LLM alignment failure where the model tells users what they want to hear rather than what is accurate, changing its stated positions to agree with user preferences even when the user is wrong.

System Prompt

A system prompt is a privileged instruction set provided to an LLM before the conversation begins, establishing the assistant's role, behavior, constraints, and capabilities for the entire session.

Temperature

Temperature is an LLM parameter (0-2) that controls output randomness: low values produce focused, deterministic responses while high values produce more varied, creative outputs.

Token Streaming

Token streaming delivers LLM responses to the user progressively as each token is generated, rather than waiting for the complete response, dramatically improving perceived responsiveness and enabling real-time interaction.

Token

A token is the basic unit of text an LLM processes—roughly 4 characters or 3/4 of an English word. LLM APIs charge per token, context windows are measured in tokens, and generation speed is measured in tokens per second.

Tokenization

Tokenization converts raw text into a sequence of tokens—the basic units an LLM processes—using algorithms like byte-pair encoding that split text into subword pieces rather than whole words or individual characters.

Tool Use

Tool use is the broader capability of LLMs to interact with external systems—executing code, browsing the web, querying databases, reading files—by calling tools during generation to retrieve information or take actions.

Top-K Sampling

Top-K sampling restricts token generation to the K most probable next tokens at each step, preventing the model from selecting rare or unlikely tokens while maintaining diversity within the top-K candidates.

Top-P Sampling (Nucleus Sampling)

Top-p sampling (nucleus sampling) restricts token generation to the smallest set of tokens whose cumulative probability exceeds p, dynamically adapting the candidate pool size based on the probability distribution.

Transformer

The transformer is the neural network architecture underlying all modern LLMs, using self-attention mechanisms to process entire input sequences in parallel and capture long-range dependencies between words.

Zero-Shot Learning

Zero-shot learning is the ability of LLMs to perform tasks from natural language instructions alone, without any task-specific examples, by generalizing from pre-training knowledge to new task types.

Large Language Models (LLMs) Glossary — 75 Terms Explained | 99helpers | 99helpers.com