Skip to content

Huge/awesome-LM-mechanistic-interpretability

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 

Repository files navigation

awesome-mechanistic-interpretability-LM-papers

This is a collection of awesome papers about Mechanistic Interpretability (MI) for Transformer-based Language Models (LMs), organized following our survey paper: A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models.

Papers are organized following our taxonomy (Figure 1). We have also curated a Beginner's Roadmap (Figure 2) with actionable items for interested people using MI for their purposes.

Figure 1: Taxonomy

Figure 2: Beginner's Roadmap

How to Contribute: We welcome contributions from everyone! If you find any relevant papers that are not included in the list, please categorize them following our taxonomy and submit a request for update.

Questions/Comments/Suggestions: If you have any questions/comments/suggestions to share with us, you are welcome to report an issue here or reach out to us through drai2@gmu.edu and ziyuyao@gmu.edu.

How to Cite: If you find our survey useful for your research, please cite our paper:

@article{rai2024practical,
  title={A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models},
  author={Rai, Daking and Zhou, Yilun and Feng, Shi and Saparov, Abulhair and Yao, Ziyu},
  journal={arXiv preprint arXiv:2407.02646},
  year={2024}
}

Updates

  • July 2024: We have finished the first iteration of the paper collection. Contributions welcomed!
  • June 2024: GitHub repository launched! Still under construction.

Table of Contents

Paper Collection

Techniques

(Back to Table of Contents)

Paper Techniques TL;DR
Interpreting GPT: the logit lens Logit lens The paper proposed the "logit lens" technique, which can be used to project intermediate activations into the vocabulary space for interpretation.
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space (EMNLP'22) Logit lens The paper showed that the "logit lens" can be used to project the second-layer of feed-forward parameter matrices into vocabulary space for interpretation.
Analyzing Transformers in Embedding Space (ACL'23) Logit lens The paper proposed a conceptual framework where all parameters of a trained Transformer are interpreted by projecting them into the vocabulary space.
Eliciting Latent Predictions from Transformers with the Tuned Lens Logit lens The paper proposed using trained affine probes before logit lens to improve reliability, where these probes are translators which are used to transform the intermediate activations to align with the representation space of the final layer.
Finding neurons in a haystack: Case studies with sparse probing (TMLR'23) Probing The paper proposed a sparse probing technique to localize a feature to a neuron or set of neurons in activations.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning SAE The paper provided advice for training SAEs, including the architecture, dataset, and other hyperparameters.
Language models can explain neurons in language models Automated Feature Explanation The paper proposed using LLMs to generate feature labels automatically and a quantitative automatic explanation score to measure the quality of explanations.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (ICLR'23) Mean-ablation, Path Patching The paper proposed to use mean-ablation for activation and path patching.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] Random-ablation, Causal Scrubbing The paper proposed random-ablation and causal ablation for evaluating the quality of mechanistic interpretation.
Locating and Editing Factual Associations in GPT (NeurIPS'22) Activation Patching The paper proposed to use activation patching to localize the layers that are responsible for the model’s factual predictions.
Localizing Model Behavior with Path Patching Path Patching The paper introduced path patching, a technique for localizing the important paths in a circuit.
Towards Automated Circuit Discovery for Mechanistic Interpretability (NeurIPS'23) ACDC The paper introduced ACDC algorithm to automate the iterative localization process.
Attribution Patching: Activation Patching At Industrial Scale Attribution Patching (AtP) The blog proposed attribution patching, an efficient technique to approximate the results of activation patching.
Attribution Patching Outperforms Automated Circuit Discovery Edge Attribution Patching (EAP) The paper introduced Edge Attribution Patching (EAP) as a more efficient alternative to ACDC for automatically identifying circuits.
AtP*: An efficient and scalable method for localizing LLM behavior to components Attribution Patching The paper introduced a variant of AtP called AtP∗ that addresses some failure mode of AtP.

Evaluation

(Back to Table of Contents)

Paper Evaluation TL;DR
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (ICLR'23) Faithfulness, Completeness, Minimality The paper proposed ablation-based techniques for the faithfulness, completeness, and minimality evaluation of the discovered circuit.
Softmax Linear Units Faithfulness For evaluation, the paper recruited human annotators to rate the interpretation of a feature based on its activations over texts.
Language models can explain neurons in language models Faithfulness The paper aimed to automate the faithfulness evaluation process. It introduced a quantitative automatic explanation score, specifically using a large LM to simulate activations based on the automatically generated labels and then comparing them with the ground-truth activations.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Plausibility The paper found that attributing a model behavior to polysemantic neurons may be less plausible as compared to monosemantic ones.

Findings and Applications

Findings on Features

(Back to Table of Contents)

Paper Techniques Evaluation TL;DR
Softmax Linear Units Visualization Faithfulness The paper investigated the impact of changing the activation function in LMs from ReLU to the softmax linear unit on the polysemanticity of neurons. It discovered "Base64 neurons" as an example.
Knowledge Neurons in Pretrained Transformers (ACL'22) Visualization Extrinsic evaluation (Knowledge editing) A gradient-based attribution score was designed, which discovered "knowledge neurons" in the FF layer of BERT.
Finding skill neurons in pre-trained transformer-based language models (EMNLP'22) Knockout Faithfulness, Extrinsic Evaluation (model pruning, cross-task prompt transfer indicator) The paper found "skill neurons" in the FF sublayers of RoBERTa-base model by measuring their correlation with the prediction labels. It also found that these neurons were likely generated from pre-training rather than prompt tuning.
Neurons in Large Language Models: Dead, N-gram, Positional Logit Lens, Visualization N/A The paper found that many FF neurons in the early layers are "dead", yet some others target removal of information or encode position information (i.e., "positional neurons").
Toy Models of Superposition Visualization N/A The paper confirmed the hypothesis of "superposition", where the authors showed that when features are sparse, the model tends to encode features in activation space using superposition.
Finding neurons in a haystack: Case studies with sparse probing (TMLR'23) Probing Faithfulness The paper proposed a sparse probing technique to localize a feature to a neuron or set of neurons in activations and found examples of monosemanticty, polysemanticity, and superposition in LMs.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning SAE, Visualization, Automated Feature Explanation Plausibility, Automated Explanation Score, The paper employed SAEs to extract features from representations implying superposition.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet SAE, Visualization, Automated Feature Explanation Extrinsic Evaluation (LM Generation Steering) The paper employed SAEs to extract features from representations implying superposition.
[Interim research report] Taking features out of superposition with sparse autoencoders SAE N/A The paper employed SAEs to extract features from representations implying superposition.
(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders SAE, Visualization Faithfulness The paper showed that the same components are reused by different circuits to implement different tasks.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (ICLR'24) SAE, Visualization Automated Explanation Score, Knockout The paper employed SAEs to extract features from representations implying superposition.

Findings on circuits

(Back to Table of Contents)

Interpreting LM Behaviors
Paper Techniques Evaluation TL;DR
A mathematical framework for transformer circuits Visualization N/A Discovered the circuit for the task of detecting and continuing repeated subsequences in the input (e.g., Mr D urs ley was thin and bold. Mr D -> urs).
In-context learning and induction heads Zero-Ablation, Visualization Faithfulness The paper demonstrated the importance of induction heads for in-context learning.
Towards automated circuit discovery for mechanistic interpretability (NeurIPS'23) ACDC Faithfulness Discovered the circuit for greater-than operations
A circuit for python docstrings in a 4-layer attention-only transformer Activation Patching, Visualization N/A Discovered the circuit for Python docstring formatting
Progress measures for grokking via mechanistic interpretability (ICLR'23) Zero-ablation, Mean-ablation, Visualization Faithfulness Discovered the circuit for modular addition
The clock and the pizza: Two stories in mechanistic explanation of neural networks (NeurIPS'23) Logit lens, Visualization N/A Discovered the circuit for modular addition
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla Logit lens, Visualization, Activation patching N/A Discovered the circuit used for the multiple-choice question-answering task on the 70B Chinchilla LLM.
Sparse feature circuits: Discovering and editing interpretable causal graphs in language models SAE, Attribution Patching, Visualization Faithfulness, Completeness, Plausibility, Extrinsic (Improving classifier generalization) Discovered the sparse Feature Circuits for Subject–Verb Agreement
Sparse autoencoders find highly interpretable features in language models (ICLR'24) SAE, Knockout, Visualization, Automated Explanation Score Faithfulness Discovered SAE features circuit for the closing parenthesis.
Circuit component reuse across tasks in transformer language models (ICLR'24) Activation patching, Path patching, Visualization N/A The paper showed that the same components are reused by different circuits to implement different tasks.
Increasing trust in language models through the reuse of verified circuits Mean-ablation, Visualization, PCA N/A The paper showed that the same components are reused by different circuits to implement different tasks.
Knowledge Circuits in Pretrained Transformers ACDC, Logit lens, Visualization Completeness Discovered Knowledge Circuits for factual-recall.
Interpreting Transformer Components
Paper Techniques Evaluation TL;DR
A mathematical framework for transformer circuits Visualization N/A The paper showed that the RS of LMs can be viewed as a one-way communication channel that transfers information from earlier to later layers. It also showed that each attention head in the MHA sublayer of a layer operates independently and can be interpreted independently. In addition, the paper discovered "copying heads" in MHA.
Interpreting GPT: the logit lens Visualization, Logit lens N/A The paper proposed to view the RS as an LM’s current "guess" for the output, which is iteratively refined layer-by-layer.
Copy suppression: Comprehensively understanding an attention head Logit lens, Mean-Ablation, Visualization N/A The paper discovered "negative heads" in GPT2-small that were responsible for reducing the logit values of the tokens that have already appeared in the context.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Path Patching, Mean-ablation, Visualization Faithfulness, Completeness, Minimality Found "previous token heads" and "duplicate token heads" in MHA.
In-context Learning and Induction Heads Zero-Ablation, Visualization Faithfulness Induction heads in MHA.
Successor Heads: Recurring, Interpretable Attention Heads In The Wild (ICLR'24) SAE, Probing, Mean-ablation, Activation Patching N/A Successor heads in MHA.
Finding Neurons in a Haystack: Case Studies with Sparse Probing (TMLR'23) Probing Faithfulness FF sublayers are attributed for the majority of feature extraction.
Locating and editing factual associations in gpt (NeurIPS'22) Activation Patching Extrinsic (Knowledge Editing) FF sublayers are responsible for storing pre-trained knowledge.
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis (EMNLP'23) Activation Patching, Visualization N/A FF sublayers function for arithmetic computation.
Transformer Feed-Forward Layers Are Key-Value Memories (EMNLP'21) Visualization N/A The paper viewed FF sublayers as key-value stores; they also demonstrated that earlier FF layers typically process shallow (syntactic or grammatical) input patterns, while later layers focus more on semantic patterns (e.g., text related to TV shows).

Findings on Universality

(Back to Table of Contents)

Paper TL;DR
Successor Heads: Recurring, Interpretable Attention Heads In The Wild The paper identifies an interpretable set of attention heads, termed "successor heads", which perform incrementation in LMs (e.g., Monday -> Tuesday, second -> third) across various scales and architectures.
In-context Learning and Induction Heads Paper found induction heads across multiple LMs.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Paper found duplication heads across multiple LMs.
Circuit component reuse across tasks in transformer language models (ICLR'24) Paper found that different circuits implementing different tasks (IOI and colored objects task) reuse the same components (e.g., induction head), demonstrating universality across tasks
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale (ACL'23) The paper discussed the importance of each component in an OPT-66B model across 14 tasks and found that some attention heads were task-agnostic.
The clock and the pizza: Two stories in mechanistic explanation of neural networks (NeurIPS'23) Paper discovered that two LMs trained with different initialization can develop qualitatively different circuits for the modular addition task.
A toy model of universality: Reverse engineering how networks learn group operations (ICML'23) Paper found that LMs trained to perform group composition on finite groups with different random weight initializations on the same task do not develop similar representations and circuits.
Universal neurons in gpt2 language models The paper found that only about 1-5% of neurons from GPT-2 models trained with random initialization exhibit universality.

Findings on Model Capabilities

(Back to Table of Contents)

Paper Techniques Evaluation TL;DR
A mathematical framework for transformer circuits Visualization N/A Paper studied a simplified case of In-Context Learning and discovered an induction circuit composed of attention heads with specialized roles (e.g., induction heads).
In-context Learning and Induction Heads Zero-Ablation, Visualization Faithfulness The paper discovered induction heads in In-Context Learning (ICL) and studied whether they provided the primary mechanisms for the majority of ICL.
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale (ACL'23) Zero-Ablation Faithfulness, Extrinsic The paper found that different transformer components have dramatically different contributions to In-Context Learning (ICL), such that removing the unimportant ones (70% attention heads and 20% FF) does not have a strong impact on model performance.
Identifying Semantic Induction Heads to Understand In-Context Learning (ACL'24) Visualization, Logit lens Faithfulness The paper investigated few-shot ICL and identified "semantic induction heads", which, unlike prior induction heads, model the semantic relationship between the input and the output token (e.g., "I have a nice pen for writing. The pen is nice to" -> "write").
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis (EMNLP'23) Activation Patching, Visualization N/A The paper studied arithmetic reasoning and found that attention heads are responsible for transferring information from operand and operator tokens to the RS of the answer or output token, with FF modules subsequently calculating the answer token.
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning (TMLR'24) Activation patching, Mean-ablation, Probing, Logit lens N/A The paper studied chain-of-thought (CoT) multi-step reasoning over fictional ontologies and found that LLMs seem to deploy multiple different neural pathways in parallel to compute the final answer.
An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs (ACL'24) Logit Lens Faithfulness The paper investigated neuron activation as a unified lens to explain how CoT elicits arithmetic reasoning of LLMs, including phenomena that were only empirically discussed in prior work.
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task Probing, Activation patching, Causal Scrubbing Faithfulness The paper discovered an interpretable algorithm in LM for the task of pathfinding in trees.

Findings on Learning Dynamics

(Back to Table of Contents)

Paper Techniques Evaluation TL;DR
In-context learning and induction heads Zero-Ablation, Visualization Faithfulness The paper showed that transformer-based LMs underwent a "phase change" early in training, during which induction heads formed and simultaneously in-context learning improved dramatically.
Progress measures for grokking via mechanistic interpretability (ICLR'23) Visualization Faithfulness The paper investigated the grokking phenomena during model training and showed that grokking, rather than being a sudden shift, consisted of three continuous phases: memorization, circuit formation, and cleanup.
Explaining grokking through circuit efficiency Visualization N/A The paper investigated the grokking phenomena as a consequence of models preferring the more efficient (in terms of parameter norm) "generalising circuits" over the less efficient "memorising circuits", and different training sizes (or the implied data complexities) lead to different efficiency cases. The paper also brought up the concepts of "ungrokking" and "semi-grokking".
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs Visualization Faithfulness The paper showed that sudden drops in the loss during training corresponded to the acquisition of attention heads that recognized specific syntactic relation. Experiments were conducted on BERT.
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition Visualization N/A The paper provided a unified explanation for grokking, double descent, and emergent abilities as a competition between memorization and generalization circuits. It particularly discussed the role of model size and extended the experiments to consider a multi-task learning paradigm.
Fine-tuning enhances existing mechanisms: A case study on entity tracking (ICLR'24) Path Patching, Activation Patching Faithfulness, Completeness, Minimaliity The paper investigated the underlying changes in mechanisms (e.g., task-relevant circuits) to understand performance enhancements in finetuned LMs. The authors found that fine-tuning does not fundamentally change the mechanisms but enhances existing ones.

Applications of MI

(Back to Table of Contents)

Paper Techniques Evaluation TL;DR
Locating and Editing Factual Associations in GPT (NeurIPS'22) Activation Patching Extrinsic (Knowledge Editing) The paper used activation patching to localize components that are responsible for storing factual knowledge, and then edited the fact (e.g., replacing "Seattle" with "Paris") by only updating the parameters of those components.
Dissecting Recall of Factual Associations in Auto-Regressive Language Models (EMNLP'23) Activation Patching Faithfulness The paper investigates how factual associations are stored and extracted internally in LMs, facilitating future research on knowledge localization and editing.
Locating and editing factual associations in mamba Activation Patching, Zero-Ablation Faithfulness, Extrinsic (Knowledge Editing) The paper explored locating, recalling, and editing facts in Mamba.
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space (EMNLP'22) Logit Lens Faithfulness, Extrinsic (Early exit prediction, toxic language generation impression) The paper suppressed toxic language generation by identifying and manually activating neurons in FF layers responsible for promoting non-toxic or safe words. It also showed that the concept promotion in the FF sublayer can be used for self-supervised early exit prediction for efficient model inference.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet SAE, Visualization, Automated Feature Explanation Faithfulness, Extrinsic (LM generation steering) The paper identified safety-related features (e.g., unsafe code, gender bias) and manipulated their activations to steer the LM towards (un)desired behaviors (e.g., safe code generation, unbiased text generation).
Emergent linear representations in world models of self-supervised sequence models (ACL'23 BlackboxNLP) Probing Faithfulness, Extrinsic (LM generation steering) The paper demonstrated that an LM’s output can be altered (e.g., flipping a player turn in the game of Othello from YOURS to MINE) by pushing its activation in the direction of a linear vector representing the desired behavior, which was identified using a linear probe.
Toy Models of Superposition N/A N/A The "enumeratibve safety" implication of superposition.
What would be the most safety-relevant features in Language Models? N/A N/A Discussion about feature discovery for AI safety.
Eliciting latent predictions from transformers with the tuned lens Logit Lens Faithfulness, Extrinsic (Prompt injection detection) Insights from circuit studies were used to detect prompt injection.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Mean-Ablation, Path Patching Faithfulness, Completeness, Minimality, Extrinsic (Adversarial example generation) Designing adversarial examples for the IOI task.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models SAE, Activation Patching Faithfulness, Completeness, Extrinsic (Improving classifier generalization) The paper improved the generalization of classifiers by identifying and ablating spurious features that humans consider to be task-irrelevant.

Tools

(Back to Table of Contents)

Tool TL;DR
CircuitsVis A library for attention visualization
TransformerLens A library for doing MI of GPT-2 Style language models.
Transformer Debugger A tool that supports investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.
LM Debugger An open-source interactive tool for inspection and intervention in transformer-based language models.
Neuroscope Repository of maximum activating dataset examples for each neuron in several LMs
Neuronpedia A platform for MI research, specifically focusing on SAEs, that allows researchers to host models, create feature dashboards, visualize data, and access various tools.
Pyvene A Library for Understanding and Improving PyTorch Models via Interventions
nninsight A library that enables interpreting and manipulating the internals of deep-learned models.
Penzai A JAX research toolkit for building, editing, and visualizing neural networks.

About

MI for (L)LM list of awesome resources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published