Sparse Neural Network Interpretability: A Comparative Analysis of Au toencoders and Transformers

Iversen, Sebastian Tendstrand

Permanent link

https://hdl.handle.net/10037/37764

View/Open

no.uit:wiseflow:7269325:62323654.pdf (7.879Mb)

(PDF)

Date

2025

Type

Master thesis

Author

Iversen, Sebastian Tendstrand

Abstract

As neural networks grow increasingly complex and powerful, understanding their internal representations becomes critical for ensuring safe and reliable AI systems. This thesis addresses a fundamental challenge in mechanistic interpretability: how architectural choices in sparse representation learning shape our ability to understand neural network internals. We present the first comprehensive theoretical and empirical comparison of Sparse Autoencoders (SAEs) and Sparse Transformers (STs), two competing approaches for decomposing neural network activations into interpretable features. We investigate a unified geometric framework revealing that these architectures, despite solving the same sparse coding problem, operate in fundamentally different spaces. SAEs use ReLU activations to create unbounded sparse features in positive orthants, while STs employ softmax attention to confine features to the probability simplex. These constraints lead to complementary capabilities: SAEs excel at Euclidean separation through magnitude differences, while STs optimize for angular discrimination through competitive dynamics. Our empirical validation across visual datasets (MNIST and Fashion-MNIST) confirms dramatic performance differences predicted by theory. SAEs achieve 100--1000× superior Euclidean separation between class representations, while STs demonstrate 10× better angular discrimination. Feature-level analysis reveals that SAEs learn shared, polysemantic features that capture cross-class patterns, whereas STs develop highly specialized, monosemantic features through winner-take-all competition. When applied to GPT-Neo 1.3B language model representations, these performance gaps narrow substantially (to 3--30×), suggesting that architectural biases, while persistent, are moderated by the complexity of real-world applications. This work establishes that the choice of interpretability architecture is not merely an implementation detail but fundamentally determines what aspects of neural computation we can observe and understand. SAEs provide superior tools for identifying distinct concepts requiring clear on/off behavior, while STs excel at revealing relational structures and relative feature importance. These complementary strengths suggest that future interpretability research should focus not on choosing between architectures but on understanding how to leverage their respective advantages for comprehensive mechanistic understanding.

Publisher

UiT The Arctic University of Norway

Metadata

Show full item record

Collections

Mastergradsoppgaver IFT [90]