Sparse Neural Network Interpretability: A Comparative Analysis of Au toencoders and Transformers

Iversen, Sebastian Tendstrand

dc.contributor.advisor	Benjamin Ricaud
dc.contributor.author	Iversen, Sebastian Tendstrand
dc.date.accessioned	2025-07-17T10:36:58Z
dc.date.available	2025-07-17T10:36:58Z
dc.date.issued	2025
dc.description.abstract	As neural networks grow increasingly complex and powerful, understanding their internal representations becomes critical for ensuring safe and reliable AI systems. This thesis addresses a fundamental challenge in mechanistic interpretability: how architectural choices in sparse representation learning shape our ability to understand neural network internals. We present the first comprehensive theoretical and empirical comparison of Sparse Autoencoders (SAEs) and Sparse Transformers (STs), two competing approaches for decomposing neural network activations into interpretable features. We investigate a unified geometric framework revealing that these architectures, despite solving the same sparse coding problem, operate in fundamentally different spaces. SAEs use ReLU activations to create unbounded sparse features in positive orthants, while STs employ softmax attention to confine features to the probability simplex. These constraints lead to complementary capabilities: SAEs excel at Euclidean separation through magnitude differences, while STs optimize for angular discrimination through competitive dynamics. Our empirical validation across visual datasets (MNIST and Fashion-MNIST) confirms dramatic performance differences predicted by theory. SAEs achieve 100--1000× superior Euclidean separation between class representations, while STs demonstrate 10× better angular discrimination. Feature-level analysis reveals that SAEs learn shared, polysemantic features that capture cross-class patterns, whereas STs develop highly specialized, monosemantic features through winner-take-all competition. When applied to GPT-Neo 1.3B language model representations, these performance gaps narrow substantially (to 3--30×), suggesting that architectural biases, while persistent, are moderated by the complexity of real-world applications. This work establishes that the choice of interpretability architecture is not merely an implementation detail but fundamentally determines what aspects of neural computation we can observe and understand. SAEs provide superior tools for identifying distinct concepts requiring clear on/off behavior, while STs excel at revealing relational structures and relative feature importance. These complementary strengths suggest that future interpretability research should focus not on choosing between architectures but on understanding how to leverage their respective advantages for comprehensive mechanistic understanding.
dc.description.abstract
dc.identifier.uri	https://hdl.handle.net/10037/37764
dc.identifier	no.uit:wiseflow:7269325:62323654
dc.language.iso	eng
dc.publisher	UiT The Arctic University of Norway
dc.rights.holder	Copyright 2025 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0	en_US
dc.rights	Attribution 4.0 International (CC BY 4.0)	en_US
dc.title	Sparse Neural Network Interpretability: A Comparative Analysis of Au toencoders and Transformers
dc.type	Master thesis

Tilhørende fil(er)

Navn:: no.uit:wiseflow:7269325:623236 ...
Størrelse:: 7.879Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Mastergradsoppgaver IFT [91]

Vis enkel innførsel

Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution 4.0 International (CC BY 4.0)