Mechanistic Interpretability

Mechanistic interpretability tries to reverse-engineer a model from the inside: identifying circuits in a Transformer, assigning roles to specific attention heads, and tracing how neuron groups encode human-understandable concepts. Anthropic's transformer-circuits.pub series and recent sparse autoencoder work are landmark examples of the approach. Unlike general Interpretability, it produces detailed mechanical maps rather than high-level natural-language explanations. The field is widely viewed as the most ambitious answer to the AI Safety question of what a model is actually thinking.