Mechanistic Interpretibility Resources
Mechanistic interpretability aims to reverse-engineer a neural network into human-understandable mechanisms. MI focuses on transformers (specifically LLMs) but is not limited to these neural network architectures
People
Primer on LLMs
Transformers
Quick Guides to MI
- What is Mechanistic Interpretability and where did it come from?
- Introduction to Mechanistic Interpretability
- “Mechanistic interpretability” for LLMs, explained
How to get started with MI ?
Relevant Papers
- Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models
- A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
- Mechanistic Interpretability for AI Safety : A Review
Straight from Anthropic
- Mapping the mind of a Large Language model
- Interpretibility Dreams
- Golden Gate Claude
- Toy Models of Superposition
- Transformer Circuits Thread
Blogs
- Neel Nanda’s case on why we need interpretibility research
- A Microscope into the Dark Matter of Interpretability
Libraries
Enjoy Reading This Article?
Here are some more articles you might like to read next: