ub.xmlui.mirage2.page-structure.muninLogoub.xmlui.mirage2.page-structure.openResearchArchiveLogo
    • EnglishEnglish
    • norsknorsk
  • Velg spraakEnglish 
    • EnglishEnglish
    • norsknorsk
  • Administration/UB
View Item 
  •   Home
  • Fakultet for naturvitenskap og teknologi
  • Institutt for fysikk og teknologi
  • Mastergradsoppgaver IFT
  • View Item
  •   Home
  • Fakultet for naturvitenskap og teknologi
  • Institutt for fysikk og teknologi
  • Mastergradsoppgaver IFT
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Sparse Neural Network Interpretability: A Comparative Analysis of Au toencoders and Transformers

Permanent link
https://hdl.handle.net/10037/37764
Thumbnail
View/Open
no.uit:wiseflow:7269325:62323654.pdf (7.879Mb)
(PDF)
Date
2025
Type
Master thesis

Author
Iversen, Sebastian Tendstrand
Abstract
As neural networks grow increasingly complex and powerful, understanding their internal representations becomes critical for ensuring safe and reliable AI systems. This thesis addresses a fundamental challenge in mechanistic interpretability: how architectural choices in sparse representation learning shape our ability to understand neural network internals. We present the first comprehensive theoretical and empirical comparison of Sparse Autoencoders (SAEs) and Sparse Transformers (STs), two competing approaches for decomposing neural network activations into interpretable features. We investigate a unified geometric framework revealing that these architectures, despite solving the same sparse coding problem, operate in fundamentally different spaces. SAEs use ReLU activations to create unbounded sparse features in positive orthants, while STs employ softmax attention to confine features to the probability simplex. These constraints lead to complementary capabilities: SAEs excel at Euclidean separation through magnitude differences, while STs optimize for angular discrimination through competitive dynamics. Our empirical validation across visual datasets (MNIST and Fashion-MNIST) confirms dramatic performance differences predicted by theory. SAEs achieve 100--1000× superior Euclidean separation between class representations, while STs demonstrate 10× better angular discrimination. Feature-level analysis reveals that SAEs learn shared, polysemantic features that capture cross-class patterns, whereas STs develop highly specialized, monosemantic features through winner-take-all competition. When applied to GPT-Neo 1.3B language model representations, these performance gaps narrow substantially (to 3--30×), suggesting that architectural biases, while persistent, are moderated by the complexity of real-world applications. This work establishes that the choice of interpretability architecture is not merely an implementation detail but fundamentally determines what aspects of neural computation we can observe and understand. SAEs provide superior tools for identifying distinct concepts requiring clear on/off behavior, while STs excel at revealing relational structures and relative feature importance. These complementary strengths suggest that future interpretability research should focus not on choosing between architectures but on understanding how to leverage their respective advantages for comprehensive mechanistic understanding.
 
 
 
Publisher
UiT The Arctic University of Norway
Metadata
Show full item record
Collections
  • Mastergradsoppgaver IFT [90]
Copyright 2025 The Author(s)

Browse

Browse all of MuninCommunities & CollectionsAuthor listTitlesBy Issue DateBrowse this CollectionAuthor listTitlesBy Issue Date
Login

Statistics

View Usage Statistics
UiT

Munin is powered by DSpace

UiT The Arctic University of Norway
The University Library
uit.no/ub - munin@ub.uit.no

Accessibility statement (Norwegian only)