Mechanistic interpretability (often shortened to "Mech Interp" or "MI") is a subfield of interpretability that seeks to reverse‑engineer neural networks, generally perceived as a black box, into human‑understandable components or "circuits", revealing the causal pathways by which models process information.[1] The object of study generally includes but is not limited to vision models and Transformer-based large language models (LLMs).
Chris Olah is generally credited with coining the term 'Mechanistic interpretability' and spearheading its early development.[2] In the 2018 paper The Building Blocks of Interpretability, Olah (then at Google Brain) and his colleagues combined existing interpretability techniques, including feature visualization, dimensionality reduction, and attribution with human-computer interface methods to explore features represented by the neurons in the vision model, Inception v1. In the March 2020 paper Zoom In: An Introduction to Circuits, Olah and the OpenAI Clarity team described "an approach inspired by neuroscience or cellular biology", hypothesizing that features, like individual cells, are the basis of computation for neural networks and connect to form circuits, which can be understood as "sub-graphs in a network".[3] In this paper, the authors described their line of work as understanding the "mechanistic implementations of neurons in terms of their weights".
In 2021, Chris Olah co-founded the company Anthropic and established its Interpretability team, which publishes their results on the Transformer Circuits Thread.[4] In December 2021, the team published A Mathematical Framework for Transformer Circuits, reverse-engineering a toy transformer with one and two attention layers. Notably, they discovered the complete algorithm of induction circuits, responsible for in-context learning of repeated token sequences. The team further elaborated this result in the March 2022 paper In-context Learning and Induction Heads.[5]
Notable results in mechanistic interpretability from 2022 include the theory of superposition wherein a model represents more features than there are directions in its representation space;[6] a mechanistic explanation for grokking, the phenomenon where test-set loss begins to decay only after a delay relative to training-set loss;[7] and the introduction of sparse autoencoders, a sparse dictionary learning method to extract interpretable features from LLMs.[8]
Mechanistic interpretability has garnered significant interest, talent, and funding in the AI safety community. In 2021, Open Philanthropy called for proposals that advanced "mechanistic understanding of neural networks" alongside other projects aimed to reduce risks from advanced AI systems.[9] The interpretability topic prompt in the request for proposal was written by Chris Olah.[10] The ML Alignment & Theory Scholars (MATS) program, a research seminar focused on AI alignment, has historically supported numerous projects in mechanistic interpretability. In its summer 2023 cohort, for example, 20% of the research projects were on mechanistic interpretability.[11]
Many organizations and research groups work on mechanistic interpretability, often with the stated goal of improving AI safety. Max Tegmark runs the Tegmark AI Safety Group at MIT, which focuses on mechanistic interpretability.[12] In February 2023, Neel Nanda started the mechanistic interpretability team at Google DeepMind. Apollo Research, an AI evals organization with a focus on interpretability research, was founded in May 2023.[13] EleutherAI has published multiple papers on interpretability.[14] Goodfire, an AI interpretability startup, was founded in 2024.[15]
Mechanistic interpretability has greatly expanded its scope, practitioners, and attention in the ML community in recent years. In July 2024, the first ICML Mechanistic Interpretability Workshop was held, aiming to bring together "separate threads of work in industry and academia".[16] In November 2024, Chris Olah discussed mechanistic interpretability on the Lex Fridman podcast as part of the Anthropic team.[17]
The term mechanistic interpretability designates both a class of technical methods—explainability methods such as saliency maps are generally not considered mechanistic interpretability research[17]—and a cultural movement.[clarification needed] Mechanistic interpretability's early development was rooted in the AI safety community, though the term is increasingly adopted by academia. In “Mechanistic?”, Saphra and Wiegreffe identify four senses of “mechanistic interpretability”:[18]
1. Narrow technical definition: A technical approach to understanding neural networks through their causal mechanisms. 2. Broad technical definition: Any research that describes the internals of a model, including its activations or weights. 3. Narrow cultural definition: Any research originating from the MI community. 4. Broad cultural definition: Any research in the field of AI—especially LM—interpretability.
1. Narrow technical definition: A technical approach to understanding neural networks through their causal mechanisms.
2. Broad technical definition: Any research that describes the internals of a model, including its activations or weights.
3. Narrow cultural definition: Any research originating from the MI community.
4. Broad cultural definition: Any research in the field of AI—especially LM—interpretability.
As the scope and popular recognition of mechanistic interpretability increase, many have begun to recognize that other communities such as natural language processing researchers have pursued similar objectives in their work.
Many researchers have challenged the core assumptions of the mechanistic approach—arguing that circuit‑level findings may not generalize to safety guarantees and that the field’s focus is too narrow for robust model verification.[19] Critics also question whether identified circuits truly capture complex, emergent behaviors or merely surface‑level statistical correlations.[20]