Interpretability
Definition
Interpretability addresses the question: how does this model work internally, not just why did it produce this specific output? Intrinsically interpretable models—decision trees, linear regression, rule-based systems—have decision logic directly readable by humans. The concept of 'mechanistic interpretability' applied to neural networks and LLMs attempts to reverse-engineer what individual neurons, circuits, and attention heads are computing, identifying the internal features and algorithms the model has learned. Probing classifiers test whether specific concepts (gender, sentiment, syntax) are encoded in specific model layers. Interpretability enables debugging, trust calibration, and scientific understanding of what models have learned.
Why It Matters
Interpretability is increasingly important as AI systems are deployed in high-stakes domains where 'trust me, the model said so' is insufficient. Regulators, auditors, and domain experts need to understand how AI systems work—not just what they output. For model debugging, interpretability research has revealed systematic errors: models 'solving' benchmarks through spurious correlations rather than genuine understanding, sentiment classifiers relying on specific punctuation patterns, and NLP models short-cutting on shallow lexical features. For safety research, understanding model internals is essential to verifying alignment—does the model represent the concepts we think it does?
How It Works
Mechanistic interpretability for LLMs uses: (1) probing—train a linear classifier on a model's internal activations to test if specific concepts are linearly encoded; (2) activation patching—replace activations at specific positions with counterfactual values and measure how the output changes (circuit tracing); (3) neuron analysis—identify what inputs maximally activate specific neurons and what concepts they represent; (4) attention analysis—visualize attention patterns to identify what the model attends to when producing specific outputs. Anthropic's research on 'superposition' and 'monosemantic features' represents state-of-the-art mechanistic interpretability work on transformer models.
Interpretability Methods
LIME
Perturbs inputs to explain one prediction
SHAP
Shapley values for feature attribution
Attention Maps
Visualize which tokens model focuses on
Probing Classifiers
Test what concepts model encodes
Concept Activation
TCAV — global concept influence
Real-World Example
Interpretability research revealed a concerning behavior in a medical NLP model: the model appeared to classify clinical notes by disease severity, but probing showed it had actually learned to use clinical note length as a proxy for severity (severe patients have longer notes due to more interventions). The model performed well on standard test sets but would catastrophically fail if note-length conventions changed (e.g., with a new EHR system). Without interpretability analysis, this spurious correlation would have gone undetected until it caused real patient harm. The finding led to architecture changes and feature engineering to force learning of clinically meaningful features.
Common Mistakes
- ✕Conflating interpretability with explainability—interpretability examines internal mechanisms; explainability produces output-level reasons
- ✕Assuming simple models are always more interpretable—a small decision tree may be legible but a large decision tree is as opaque as a neural network
- ✕Using interpretability as a binary property—models exist on a spectrum of interpretability; the right level depends on the stakes and audience
Related Terms
Explainability
Explainability provides human-understandable reasons for why an AI system produced a specific output—enabling users, operators, and regulators to understand, audit, and trust AI decisions rather than treating the model as an inscrutable black box.
SHAP Values
SHAP (SHapley Additive exPlanations) values assign each feature a precise contribution score for a specific model prediction—using game theory to fairly distribute the prediction value among all input features for interpretable AI explanations.
AI Bias
AI bias is the systematic tendency of AI models to produce unfair outcomes for certain groups—arising from skewed training data, biased features, or flawed objective functions—leading to discriminatory predictions or decisions.
Responsible AI
Responsible AI is a framework of organizational practices and principles—encompassing fairness, transparency, privacy, safety, and accountability—that guide how teams build and deploy AI systems that are trustworthy and beneficial.
Algorithmic Fairness
Algorithmic fairness defines formal mathematical criteria for measuring and achieving equitable treatment across demographic groups in AI decision systems—including demographic parity, equalized odds, and individual fairness.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →