Mechanistic interpretability tries to reverse-engineer a model from the inside: identifying circuits in a Transformer, assigning roles to specific attention heads, and tracing how neuron groups encode human-understandable concepts. Anthropic's transformer-circuits.pub series and recent sparse autoencoder work are landmark examples of the approach. Unlike general Interpretability, it produces detailed mechanical maps rather than high-level natural-language explanations. The field is widely viewed as the most ambitious answer to the AI Safety question of what a model is actually thinking.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Advanced · 2022
Mechanistic Interpretability
An interpretability branch that reverse-engineers a model's internal circuits and neuron-level interactions.
- EN — English term
- Mechanistic Interpretability
- TR — Turkish term
- Mekanik Yorumlanabilirlik