Mechanistic interpretability

Mechanistic interpretability (often shortened to "Mech Interp" or "MI") is a subfield of interpretability that seeks to reverse‑engineer neural networks, generally perceived as a black box, into human‑understandable components or "circuits", revealing the causal pathways by which models process information.^[1] The object of study generally includes but is not limited to vision models and Transformer-based large language models (LLMs).

History

Chris Olah is generally credited with coining the term 'Mechanistic interpretability' and spearheading its early development.^[2] In the 2018 paper The Building Blocks of Interpretability, Olah (then at Google Brain) and his colleagues combined existing interpretability techniques, including feature visualization, dimensionality reduction, and attribution with human-computer interface methods to explore features represented by the neurons in the vision model, Inception v1. In the March 2020 paper Zoom In: An Introduction to Circuits, Olah and the OpenAI Clarity team described "an approach inspired by neuroscience or cellular biology", hypothesizing that features, like individual cells, are the basis of computation for neural networks and connect to form circuits, which can be understood as "sub-graphs in a network".^[3] In this paper, the authors described their line of work as understanding the "mechanistic implementations of neurons in terms of their weights".

In 2021, Chris Olah co-founded the company Anthropic and established its Interpretability team, which publishes their results on the Transformer Circuits Thread.^[4] In December 2021, the team published A Mathematical Framework for Transformer Circuits, reverse-engineering a toy transformer with one and two attention layers. Notably, they discovered the complete algorithm of induction circuits, responsible for in-context learning of repeated token sequences. The team further elaborated this result in the March 2022 paper In-context Learning and Induction Heads.^[5]

Notable results in mechanistic interpretability from 2022 include the theory of superposition wherein a model represents more features than there are directions in its representation space;^[6] a mechanistic explanation for grokking, the phenomenon where test-set loss begins to decay only after a delay relative to training-set loss;^[7] and the introduction of sparse autoencoders, a sparse dictionary learning method to extract interpretable features from LLMs.^[8]

Mechanistic interpretability has garnered significant interest, talent, and funding in the AI safety community. In 2021, Open Philanthropy called for proposals that advanced "mechanistic understanding of neural networks" alongside other projects aimed to reduce risks from advanced AI systems.^[9] The interpretability topic prompt in the request for proposal was written by Chris Olah.^[10] The ML Alignment & Theory Scholars (MATS) program, a research seminar focused on AI alignment, has historically supported numerous projects in mechanistic interpretability. In its summer 2023 cohort, for example, 20% of the research projects were on mechanistic interpretability.^[11]

Many organizations and research groups work on mechanistic interpretability, often with the stated goal of improving AI safety. Max Tegmark runs the Tegmark AI Safety Group at MIT, which focuses on mechanistic interpretability.^[12] In February 2023, Neel Nanda started the mechanistic interpretability team at Google DeepMind. Apollo Research, an AI evals organization with a focus on interpretability research, was founded in May 2023.^[13] EleutherAI has published multiple papers on interpretability.^[14] Goodfire, an AI interpretability startup, was founded in 2024.^[15]

Mechanistic interpretability has greatly expanded its scope, practitioners, and attention in the ML community in recent years. In July 2024, the first ICML Mechanistic Interpretability Workshop was held, aiming to bring together "separate threads of work in industry and academia".^[16] In November 2024, Chris Olah discussed mechanistic interpretability on the Lex Fridman podcast as part of the Anthropic team.^[17]

Cultural distinction between explainability, interpretability and mechanistic interpretability

The term mechanistic interpretability designates both a class of technical methods—explainability methods such as saliency maps are generally not considered mechanistic interpretability research^[17]—and a cultural movement.^{[clarification needed]} Mechanistic interpretability's early development was rooted in the AI safety community, though the term is increasingly adopted by academia. In “Mechanistic?”, Saphra and Wiegreffe identify four senses of “mechanistic interpretability”:^[18]

1. Narrow technical definition: A technical approach to understanding neural networks through their causal mechanisms.
2. Broad technical definition: Any research that describes the internals of a model, including its activations or weights.
3. Narrow cultural definition: Any research originating from the MI community.
4. Broad cultural definition: Any research in the field of AI—especially LM—interpretability.

As the scope and popular recognition of mechanistic interpretability increase, many have begun to recognize that other communities such as natural language processing researchers have pursued similar objectives in their work.

Critique

Many researchers have challenged the core assumptions of the mechanistic approach—arguing that circuit‑level findings may not generalize to safety guarantees and that the field’s focus is too narrow for robust model verification.^[19] Critics also question whether identified circuits truly capture complex, emergent behaviors or merely surface‑level statistical correlations.^[20]

References

^ "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases". transformer-circuits.pub. Retrieved 2025-05-03.
^ Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].
^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020-03-10). "Zoom In: An Introduction to Circuits". Distill. 5 (3). doi:10.23915/distill.00024.001. ISSN 2476-0757.
^ "Transformer Circuits Thread". transformer-circuits.pub. Retrieved 2025-05-12.
^ Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom; Drain, Dawn; Ganguli, Deep; Hatfield-Dodds, Zac; Hernandez, Danny; Johnston, Scott; Jones, Andy; Kernion, Jackson; Lovitt, Liane; Ndousse, Kamal; Amodei, Dario; Brown, Tom; Clark, Jack; Kaplan, Jared; McCandlish, Sam; Olah, Chris (2022). "In-context Learning and Induction Heads". arXiv:2209.11895 [cs.LG].
^ Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; Olah, Christopher (2022). "Toy Models of Superposition". arXiv:2209.10652 [cs.LG].
^ Nanda, Neel; Chan, Lawrence; Lieberum, Tom; Smith, Jess; Steinhardt, Jacob (2023). "Progress measures for grokking via mechanistic interpretability". arXiv:2301.05217 [cs.LG].
^ Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., ... & Olah, C. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2.
^ "Request for proposals for projects in AI alignment that work with deep learning systems". Open Philanthropy. Retrieved 2025-05-12.
^ "Interpretability". Alignment Forum. 2021-10-29.
^ Gil, Juan; Kidd, Ryan; Smith, Christian (December 1, 2023). "MATS Summer 2023 Retrospective".
^ "Tegmark Group". tegmark.org. Retrieved 2025-05-12.
^ Hobbhahn, Marius; Millidge, Beren; Sharkey, Lee; Bushnaq, Lucius; Braun, Dan; Balesni, Mikita; Scheurer, Jérémy (2023-05-30). "Announcing Apollo Research".
^ "Interpretability". EleutherAI. 2024-02-06. Retrieved 2025-05-12.
^ Sullivan, Mark (2025-04-22). "This startup wants to reprogram the mind of AI—and just got $50 million to do it". Fast Company. Archived from the original on 2025-05-04. Retrieved 2025-05-12.
^ "ICML 2024 Mechanistic Interpretability Workshop". icml2024mi.pages.dev. Retrieved 2025-05-12.
^ ^a ^b "Mechanistic Interpretability explained – Chris Olah and Lex Fridman". YouTube. 14 November 2024. Retrieved 2025-05-03.
^ Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].
^ "EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety". Alignment Forum. 17 February 2023. Retrieved 2025-05-03.
^ Levy, Steven. "AI Is a Black Box. Anthropic Figured Out a Way to Look Inside". Wired. Retrieved 2025-05-03.

[1] "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases". transformer-circuits.pub. Retrieved 2025-05-03.

[2] Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].

[3] Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020-03-10). "Zoom In: An Introduction to Circuits". Distill. 5 (3). doi:10.23915/distill.00024.001. ISSN 2476-0757.

[4] "Transformer Circuits Thread". transformer-circuits.pub. Retrieved 2025-05-12.

[5] Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom; Drain, Dawn; Ganguli, Deep; Hatfield-Dodds, Zac; Hernandez, Danny; Johnston, Scott; Jones, Andy; Kernion, Jackson; Lovitt, Liane; Ndousse, Kamal; Amodei, Dario; Brown, Tom; Clark, Jack; Kaplan, Jared; McCandlish, Sam; Olah, Chris (2022). "In-context Learning and Induction Heads". arXiv:2209.11895 [cs.LG].

[6] Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; Olah, Christopher (2022). "Toy Models of Superposition". arXiv:2209.10652 [cs.LG].

[7] Nanda, Neel; Chan, Lawrence; Lieberum, Tom; Smith, Jess; Steinhardt, Jacob (2023). "Progress measures for grokking via mechanistic interpretability". arXiv:2301.05217 [cs.LG].

[8] Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., ... & Olah, C. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2.

[9] "Request for proposals for projects in AI alignment that work with deep learning systems". Open Philanthropy. Retrieved 2025-05-12.

[10] "Interpretability". Alignment Forum. 2021-10-29.

[11] Gil, Juan; Kidd, Ryan; Smith, Christian (December 1, 2023). "MATS Summer 2023 Retrospective".

[12] "Tegmark Group". tegmark.org. Retrieved 2025-05-12.

[13] Hobbhahn, Marius; Millidge, Beren; Sharkey, Lee; Bushnaq, Lucius; Braun, Dan; Balesni, Mikita; Scheurer, Jérémy (2023-05-30). "Announcing Apollo Research".

[14] "Interpretability". EleutherAI. 2024-02-06. Retrieved 2025-05-12.

[15] Sullivan, Mark (2025-04-22). "This startup wants to reprogram the mind of AI—and just got $50 million to do it". Fast Company. Archived from the original on 2025-05-04. Retrieved 2025-05-12.

[16] "ICML 2024 Mechanistic Interpretability Workshop". icml2024mi.pages.dev. Retrieved 2025-05-12.

[:0-17] "Mechanistic Interpretability explained – Chris Olah and Lex Fridman". YouTube. 14 November 2024. Retrieved 2025-05-03.

[18] Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].

[19] "EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety". Alignment Forum. 17 February 2023. Retrieved 2025-05-03.

[20] Levy, Steven. "AI Is a Black Box. Anthropic Figured Out a Way to Look Inside". Wired. Retrieved 2025-05-03.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

Mechanistic interpretability

History

Cultural distinction between explainability, interpretability and mechanistic interpretability

Critique

References

Portal di Ensiklopedia Dunia