awesome-mechanistic-interpretability-LM-papers

This is a collection of awesome papers about Mechanistic Interpretability (MI) for Transformer-based Language Models (LMs), organized following our survey paper: A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models.

Papers are organized following our taxonomy (Figure 1). We have also curated a Beginner's Roadmap (Figure 2) with actionable items for interested people using MI for their purposes.

Figure 1: Taxonomy

Figure 2: Beginner's Roadmap

How to Contribute: We welcome contributions from everyone! If you find any relevant papers that are not included in the list, please categorize them following our taxonomy and submit a request for update.

Questions/Comments/Suggestions: If you have any questions/comments/suggestions to share with us, you are welcome to report an issue here or reach out to us through drai2@gmu.edu and ziyuyao@gmu.edu.

How to Cite: If you find our survey useful for your research, please cite our paper:

@article{rai2024practical,
  title={A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models},
  author={Rai, Daking and Zhou, Yilun and Feng, Shi and Saparov, Abulhair and Yao, Ziyu},
  journal={arXiv preprint arXiv:2407.02646},
  year={2024}
}

Updates

July 2024: We have finished the first iteration of the paper collection. Contributions welcomed!
June 2024: GitHub repository launched! Still under construction.

Paper Collection

Techniques

(Back to Table of Contents)

Paper	Techniques	TL;DR
Interpreting GPT: the logit lens	Logit lens	The paper proposed the "logit lens" technique, which can be used to project intermediate activations into the vocabulary space for interpretation.
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space (EMNLP'22)	Logit lens	The paper showed that the "logit lens" can be used to project the second-layer of feed-forward parameter matrices into vocabulary space for interpretation.
Analyzing Transformers in Embedding Space (ACL'23)	Logit lens	The paper proposed a conceptual framework where all parameters of a trained Transformer are interpreted by projecting them into the vocabulary space.
Eliciting Latent Predictions from Transformers with the Tuned Lens	Logit lens	The paper proposed using trained affine probes before logit lens to improve reliability, where these probes are translators which are used to transform the intermediate activations to align with the representation space of the final layer.
Finding neurons in a haystack: Case studies with sparse probing (TMLR'23)	Probing	The paper proposed a sparse probing technique to localize a feature to a neuron or set of neurons in activations.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning	SAE	The paper provided advice for training SAEs, including the architecture, dataset, and other hyperparameters.
Language models can explain neurons in language models	Automated Feature Explanation	The paper proposed using LLMs to generate feature labels automatically and a quantitative automatic explanation score to measure the quality of explanations.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (ICLR'23)	Mean-ablation, Path Patching	The paper proposed to use mean-ablation for activation and path patching.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]	Random-ablation, Causal Scrubbing	The paper proposed random-ablation and causal ablation for evaluating the quality of mechanistic interpretation.
Locating and Editing Factual Associations in GPT (NeurIPS'22)	Activation Patching	The paper proposed to use activation patching to localize the layers that are responsible for the model’s factual predictions.
Localizing Model Behavior with Path Patching	Path Patching	The paper introduced path patching, a technique for localizing the important paths in a circuit.
Towards Automated Circuit Discovery for Mechanistic Interpretability (NeurIPS'23)	ACDC	The paper introduced ACDC algorithm to automate the iterative localization process.
Attribution Patching: Activation Patching At Industrial Scale	Attribution Patching (AtP)	The blog proposed attribution patching, an efficient technique to approximate the results of activation patching.
Attribution Patching Outperforms Automated Circuit Discovery	Edge Attribution Patching (EAP)	The paper introduced Edge Attribution Patching (EAP) as a more efficient alternative to ACDC for automatically identifying circuits.
AtP*: An efficient and scalable method for localizing LLM behavior to components	Attribution Patching	The paper introduced a variant of AtP called AtP∗ that addresses some failure mode of AtP.

Evaluation

(Back to Table of Contents)

Paper	Evaluation	TL;DR
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (ICLR'23)	Faithfulness, Completeness, Minimality	The paper proposed ablation-based techniques for the faithfulness, completeness, and minimality evaluation of the discovered circuit.
Softmax Linear Units	Faithfulness	For evaluation, the paper recruited human annotators to rate the interpretation of a feature based on its activations over texts.
Language models can explain neurons in language models	Faithfulness	The paper aimed to automate the faithfulness evaluation process. It introduced a quantitative automatic explanation score, specifically using a large LM to simulate activations based on the automatically generated labels and then comparing them with the ground-truth activations.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning	Plausibility	The paper found that attributing a model behavior to polysemantic neurons may be less plausible as compared to monosemantic ones.

Findings and Applications

Findings on Features

(Back to Table of Contents)

Paper	Techniques	Evaluation	TL;DR
Softmax Linear Units	Visualization	Faithfulness	The paper investigated the impact of changing the activation function in LMs from ReLU to the softmax linear unit on the polysemanticity of neurons. It discovered "Base64 neurons" as an example.
Knowledge Neurons in Pretrained Transformers (ACL'22)	Visualization	Extrinsic evaluation (Knowledge editing)	A gradient-based attribution score was designed, which discovered "knowledge neurons" in the FF layer of BERT.
Finding skill neurons in pre-trained transformer-based language models (EMNLP'22)	Knockout	Faithfulness, Extrinsic Evaluation (model pruning, cross-task prompt transfer indicator)	The paper found "skill neurons" in the FF sublayers of RoBERTa-base model by measuring their correlation with the prediction labels. It also found that these neurons were likely generated from pre-training rather than prompt tuning.
Neurons in Large Language Models: Dead, N-gram, Positional	Logit Lens, Visualization	N/A	The paper found that many FF neurons in the early layers are "dead", yet some others target removal of information or encode position information (i.e., "positional neurons").
Toy Models of Superposition	Visualization	N/A	The paper confirmed the hypothesis of "superposition", where the authors showed that when features are sparse, the model tends to encode features in activation space using superposition.
Finding neurons in a haystack: Case studies with sparse probing (TMLR'23)	Probing	Faithfulness	The paper proposed a sparse probing technique to localize a feature to a neuron or set of neurons in activations and found examples of monosemanticty, polysemanticity, and superposition in LMs.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning	SAE, Visualization, Automated Feature Explanation	Plausibility, Automated Explanation Score,	The paper employed SAEs to extract features from representations implying superposition.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet	SAE, Visualization, Automated Feature Explanation	Extrinsic Evaluation (LM Generation Steering)	The paper employed SAEs to extract features from representations implying superposition.
[Interim research report] Taking features out of superposition with sparse autoencoders	SAE	N/A	The paper employed SAEs to extract features from representations implying superposition.
(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders	SAE, Visualization	Faithfulness	The paper showed that the same components are reused by different circuits to implement different tasks.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (ICLR'24)	SAE, Visualization	Automated Explanation Score, Knockout	The paper employed SAEs to extract features from representations implying superposition.

Findings on circuits

(Back to Table of Contents)

Interpreting LM Behaviors

Paper	Techniques	Evaluation	TL;DR
A mathematical framework for transformer circuits	Visualization	N/A	Discovered the circuit for the task of detecting and continuing repeated subsequences in the input (e.g., Mr D urs ley was thin and bold. Mr D -> urs).
In-context learning and induction heads	Zero-Ablation, Visualization	Faithfulness	The paper demonstrated the importance of induction heads for in-context learning.
Towards automated circuit discovery for mechanistic interpretability (NeurIPS'23)	ACDC	Faithfulness	Discovered the circuit for greater-than operations
A circuit for python docstrings in a 4-layer attention-only transformer	Activation Patching, Visualization	N/A	Discovered the circuit for Python docstring formatting
Progress measures for grokking via mechanistic interpretability (ICLR'23)	Zero-ablation, Mean-ablation, Visualization	Faithfulness	Discovered the circuit for modular addition
The clock and the pizza: Two stories in mechanistic explanation of neural networks (NeurIPS'23)	Logit lens, Visualization	N/A	Discovered the circuit for modular addition
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla	Logit lens, Visualization, Activation patching	N/A	Discovered the circuit used for the multiple-choice question-answering task on the 70B Chinchilla LLM.
Sparse feature circuits: Discovering and editing interpretable causal graphs in language models	SAE, Attribution Patching, Visualization	Faithfulness, Completeness, Plausibility, Extrinsic (Improving classifier generalization)	Discovered the sparse Feature Circuits for Subject–Verb Agreement
Sparse autoencoders find highly interpretable features in language models (ICLR'24)	SAE, Knockout, Visualization, Automated Explanation Score	Faithfulness	Discovered SAE features circuit for the closing parenthesis.
Circuit component reuse across tasks in transformer language models (ICLR'24)	Activation patching, Path patching, Visualization	N/A	The paper showed that the same components are reused by different circuits to implement different tasks.
Increasing trust in language models through the reuse of verified circuits	Mean-ablation, Visualization, PCA	N/A	The paper showed that the same components are reused by different circuits to implement different tasks.
Knowledge Circuits in Pretrained Transformers	ACDC, Logit lens, Visualization	Completeness	Discovered Knowledge Circuits for factual-recall.

Interpreting Transformer Components

Paper	Techniques	Evaluation	TL;DR
A mathematical framework for transformer circuits	Visualization	N/A	The paper showed that the RS of LMs can be viewed as a one-way communication channel that transfers information from earlier to later layers. It also showed that each attention head in the MHA sublayer of a layer operates independently and can be interpreted independently. In addition, the paper discovered "copying heads" in MHA.
Interpreting GPT: the logit lens	Visualization, Logit lens	N/A	The paper proposed to view the RS as an LM’s current "guess" for the output, which is iteratively refined layer-by-layer.
Copy suppression: Comprehensively understanding an attention head	Logit lens, Mean-Ablation, Visualization	N/A	The paper discovered "negative heads" in GPT2-small that were responsible for reducing the logit values of the tokens that have already appeared in the context.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small	Path Patching, Mean-ablation, Visualization	Faithfulness, Completeness, Minimality	Found "previous token heads" and "duplicate token heads" in MHA.
In-context Learning and Induction Heads	Zero-Ablation, Visualization	Faithfulness	Induction heads in MHA.
Successor Heads: Recurring, Interpretable Attention Heads In The Wild (ICLR'24)	SAE, Probing, Mean-ablation, Activation Patching	N/A	Successor heads in MHA.
Finding Neurons in a Haystack: Case Studies with Sparse Probing (TMLR'23)	Probing	Faithfulness	FF sublayers are attributed for the majority of feature extraction.
Locating and editing factual associations in gpt (NeurIPS'22)	Activation Patching	Extrinsic (Knowledge Editing)	FF sublayers are responsible for storing pre-trained knowledge.
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis (EMNLP'23)	Activation Patching, Visualization	N/A	FF sublayers function for arithmetic computation.
Transformer Feed-Forward Layers Are Key-Value Memories (EMNLP'21)	Visualization	N/A	The paper viewed FF sublayers as key-value stores; they also demonstrated that earlier FF layers typically process shallow (syntactic or grammatical) input patterns, while later layers focus more on semantic patterns (e.g., text related to TV shows).

Findings on Universality

(Back to Table of Contents)

Paper	TL;DR
Successor Heads: Recurring, Interpretable Attention Heads In The Wild	The paper identifies an interpretable set of attention heads, termed "successor heads", which perform incrementation in LMs (e.g., Monday -> Tuesday, second -> third) across various scales and architectures.
In-context Learning and Induction Heads	Paper found induction heads across multiple LMs.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small	Paper found duplication heads across multiple LMs.
Circuit component reuse across tasks in transformer language models (ICLR'24)	Paper found that different circuits implementing different tasks (IOI and colored objects task) reuse the same components (e.g., induction head), demonstrating universality across tasks
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale (ACL'23)	The paper discussed the importance of each component in an OPT-66B model across 14 tasks and found that some attention heads were task-agnostic.
The clock and the pizza: Two stories in mechanistic explanation of neural networks (NeurIPS'23)	Paper discovered that two LMs trained with different initialization can develop qualitatively different circuits for the modular addition task.
A toy model of universality: Reverse engineering how networks learn group operations (ICML'23)	Paper found that LMs trained to perform group composition on finite groups with different random weight initializations on the same task do not develop similar representations and circuits.
Universal neurons in gpt2 language models	The paper found that only about 1-5% of neurons from GPT-2 models trained with random initialization exhibit universality.

Findings on Model Capabilities

(Back to Table of Contents)

Paper	Techniques	Evaluation	TL;DR
A mathematical framework for transformer circuits	Visualization	N/A	Paper studied a simplified case of In-Context Learning and discovered an induction circuit composed of attention heads with specialized roles (e.g., induction heads).
In-context Learning and Induction Heads	Zero-Ablation, Visualization	Faithfulness	The paper discovered induction heads in In-Context Learning (ICL) and studied whether they provided the primary mechanisms for the majority of ICL.
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale (ACL'23)	Zero-Ablation	Faithfulness, Extrinsic	The paper found that different transformer components have dramatically different contributions to In-Context Learning (ICL), such that removing the unimportant ones (70% attention heads and 20% FF) does not have a strong impact on model performance.
Identifying Semantic Induction Heads to Understand In-Context Learning (ACL'24)	Visualization, Logit lens	Faithfulness	The paper investigated few-shot ICL and identified "semantic induction heads", which, unlike prior induction heads, model the semantic relationship between the input and the output token (e.g., "I have a nice pen for writing. The pen is nice to" -> "write").
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis (EMNLP'23)	Activation Patching, Visualization	N/A	The paper studied arithmetic reasoning and found that attention heads are responsible for transferring information from operand and operator tokens to the RS of the answer or output token, with FF modules subsequently calculating the answer token.
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning (TMLR'24)	Activation patching, Mean-ablation, Probing, Logit lens	N/A	The paper studied chain-of-thought (CoT) multi-step reasoning over fictional ontologies and found that LLMs seem to deploy multiple different neural pathways in parallel to compute the final answer.
An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs (ACL'24)	Logit Lens	Faithfulness	The paper investigated neuron activation as a unified lens to explain how CoT elicits arithmetic reasoning of LLMs, including phenomena that were only empirically discussed in prior work.
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task	Probing, Activation patching, Causal Scrubbing	Faithfulness	The paper discovered an interpretable algorithm in LM for the task of pathfinding in trees.

Findings on Learning Dynamics

(Back to Table of Contents)

Paper	Techniques	Evaluation	TL;DR
In-context learning and induction heads	Zero-Ablation, Visualization	Faithfulness	The paper showed that transformer-based LMs underwent a "phase change" early in training, during which induction heads formed and simultaneously in-context learning improved dramatically.
Progress measures for grokking via mechanistic interpretability (ICLR'23)	Visualization	Faithfulness	The paper investigated the grokking phenomena during model training and showed that grokking, rather than being a sudden shift, consisted of three continuous phases: memorization, circuit formation, and cleanup.
Explaining grokking through circuit efficiency	Visualization	N/A	The paper investigated the grokking phenomena as a consequence of models preferring the more efficient (in terms of parameter norm) "generalising circuits" over the less efficient "memorising circuits", and different training sizes (or the implied data complexities) lead to different efficiency cases. The paper also brought up the concepts of "ungrokking" and "semi-grokking".
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs	Visualization	Faithfulness	The paper showed that sudden drops in the loss during training corresponded to the acquisition of attention heads that recognized specific syntactic relation. Experiments were conducted on BERT.
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition	Visualization	N/A	The paper provided a unified explanation for grokking, double descent, and emergent abilities as a competition between memorization and generalization circuits. It particularly discussed the role of model size and extended the experiments to consider a multi-task learning paradigm.
Fine-tuning enhances existing mechanisms: A case study on entity tracking (ICLR'24)	Path Patching, Activation Patching	Faithfulness, Completeness, Minimaliity	The paper investigated the underlying changes in mechanisms (e.g., task-relevant circuits) to understand performance enhancements in finetuned LMs. The authors found that fine-tuning does not fundamentally change the mechanisms but enhances existing ones.

Applications of MI

(Back to Table of Contents)

Paper	Techniques	Evaluation	TL;DR
Locating and Editing Factual Associations in GPT (NeurIPS'22)	Activation Patching	Extrinsic (Knowledge Editing)	The paper used activation patching to localize components that are responsible for storing factual knowledge, and then edited the fact (e.g., replacing "Seattle" with "Paris") by only updating the parameters of those components.
Dissecting Recall of Factual Associations in Auto-Regressive Language Models (EMNLP'23)	Activation Patching	Faithfulness	The paper investigates how factual associations are stored and extracted internally in LMs, facilitating future research on knowledge localization and editing.
Locating and editing factual associations in mamba	Activation Patching, Zero-Ablation	Faithfulness, Extrinsic (Knowledge Editing)	The paper explored locating, recalling, and editing facts in Mamba.
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space (EMNLP'22)	Logit Lens	Faithfulness, Extrinsic (Early exit prediction, toxic language generation impression)	The paper suppressed toxic language generation by identifying and manually activating neurons in FF layers responsible for promoting non-toxic or safe words. It also showed that the concept promotion in the FF sublayer can be used for self-supervised early exit prediction for efficient model inference.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet	SAE, Visualization, Automated Feature Explanation	Faithfulness, Extrinsic (LM generation steering)	The paper identified safety-related features (e.g., unsafe code, gender bias) and manipulated their activations to steer the LM towards (un)desired behaviors (e.g., safe code generation, unbiased text generation).
Emergent linear representations in world models of self-supervised sequence models (ACL'23 BlackboxNLP)	Probing	Faithfulness, Extrinsic (LM generation steering)	The paper demonstrated that an LM’s output can be altered (e.g., flipping a player turn in the game of Othello from YOURS to MINE) by pushing its activation in the direction of a linear vector representing the desired behavior, which was identified using a linear probe.
Toy Models of Superposition	N/A	N/A	The "enumeratibve safety" implication of superposition.
What would be the most safety-relevant features in Language Models?	N/A	N/A	Discussion about feature discovery for AI safety.
Eliciting latent predictions from transformers with the tuned lens	Logit Lens	Faithfulness, Extrinsic (Prompt injection detection)	Insights from circuit studies were used to detect prompt injection.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small	Mean-Ablation, Path Patching	Faithfulness, Completeness, Minimality, Extrinsic (Adversarial example generation)	Designing adversarial examples for the IOI task.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models	SAE, Activation Patching	Faithfulness, Completeness, Extrinsic (Improving classifier generalization)	The paper improved the generalization of classifiers by identifying and ablating spurious features that humans consider to be task-irrelevant.

Tools

(Back to Table of Contents)

Tool	TL;DR
CircuitsVis	A library for attention visualization
TransformerLens	A library for doing MI of GPT-2 Style language models.
Transformer Debugger	A tool that supports investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.
LM Debugger	An open-source interactive tool for inspection and intervention in transformer-based language models.
Neuroscope	Repository of maximum activating dataset examples for each neuron in several LMs
Neuronpedia	A platform for MI research, specifically focusing on SAEs, that allows researchers to host models, create feature dashboards, visualize data, and access various tools.
Pyvene	A Library for Understanding and Improving PyTorch Models via Interventions
nninsight	A library that enables interpreting and manipulating the internals of deep-learned models.
Penzai	A JAX research toolkit for building, editing, and visualizing neural networks.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

awesome-mechanistic-interpretability-LM-papers

Updates

Table of Contents

Paper Collection

Techniques

Evaluation

Findings and Applications

Findings on Features

Findings on circuits

Interpreting LM Behaviors

Interpreting Transformer Components

Findings on Universality

Findings on Model Capabilities

Findings on Learning Dynamics

Applications of MI

Tools

About

Uh oh!

Releases

Packages

Huge/awesome-LM-mechanistic-interpretability

Folders and files

Latest commit

History

Repository files navigation

awesome-mechanistic-interpretability-LM-papers

Updates

Table of Contents

Paper Collection

Techniques

Evaluation

Findings and Applications

Findings on Features

Findings on circuits

Interpreting LM Behaviors

Interpreting Transformer Components

Findings on Universality

Findings on Model Capabilities

Findings on Learning Dynamics

Applications of MI

Tools

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages