AI Infrastructure, Safety & Ethics

Interpretability

Definition

Interpretability addresses the question: how does this model work internally, not just why did it produce this specific output? Intrinsically interpretable models—decision trees, linear regression, rule-based systems—have decision logic directly readable by humans. The concept of 'mechanistic interpretability' applied to neural networks and LLMs attempts to reverse-engineer what individual neurons, circuits, and attention heads are computing, identifying the internal features and algorithms the model has learned. Probing classifiers test whether specific concepts (gender, sentiment, syntax) are encoded in specific model layers. Interpretability enables debugging, trust calibration, and scientific understanding of what models have learned.

Why It Matters

Interpretability is increasingly important as AI systems are deployed in high-stakes domains where 'trust me, the model said so' is insufficient. Regulators, auditors, and domain experts need to understand how AI systems work—not just what they output. For model debugging, interpretability research has revealed systematic errors: models 'solving' benchmarks through spurious correlations rather than genuine understanding, sentiment classifiers relying on specific punctuation patterns, and NLP models short-cutting on shallow lexical features. For safety research, understanding model internals is essential to verifying alignment—does the model represent the concepts we think it does?

How It Works

Mechanistic interpretability for LLMs uses: (1) probing—train a linear classifier on a model's internal activations to test if specific concepts are linearly encoded; (2) activation patching—replace activations at specific positions with counterfactual values and measure how the output changes (circuit tracing); (3) neuron analysis—identify what inputs maximally activate specific neurons and what concepts they represent; (4) attention analysis—visualize attention patterns to identify what the model attends to when producing specific outputs. Anthropic's research on 'superposition' and 'monosemantic features' represents state-of-the-art mechanistic interpretability work on transformer models.

Interpretability Methods

local

LIME

Perturbs inputs to explain one prediction

local

SHAP

Shapley values for feature attribution

local

Attention Maps

Visualize which tokens model focuses on

global

Probing Classifiers

Test what concepts model encodes

global

Concept Activation

TCAV — global concept influence

Real-World Example

Interpretability research revealed a concerning behavior in a medical NLP model: the model appeared to classify clinical notes by disease severity, but probing showed it had actually learned to use clinical note length as a proxy for severity (severe patients have longer notes due to more interventions). The model performed well on standard test sets but would catastrophically fail if note-length conventions changed (e.g., with a new EHR system). Without interpretability analysis, this spurious correlation would have gone undetected until it caused real patient harm. The finding led to architecture changes and feature engineering to force learning of clinically meaningful features.

Common Mistakes

  • Conflating interpretability with explainability—interpretability examines internal mechanisms; explainability produces output-level reasons
  • Assuming simple models are always more interpretable—a small decision tree may be legible but a large decision tree is as opaque as a neural network
  • Using interpretability as a binary property—models exist on a spectrum of interpretability; the right level depends on the stakes and audience

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Interpretability? Interpretability Definition & Guide | 99helpers | 99helpers.com