Multi-Layer Sparse Autoencoders for Transformer Interpretation

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Multi-Layer Sparse Autoencoders for Transformer Interpretation

Listen for free

View show details

About this listen

This paper introduces the Multi-Layer Sparse Autoencoder (MLSAE), a novel approach for interpreting the internal representations of transformer language models. Unlike traditional Sparse Autoencoders (SAEs) that analyze individual layers, MLSAEs are trained across all layers of a transformer's residual stream, enabling the study of information flow across layers. The research found that while individual "latents" (features learned by the SAE) tend to be active at a single layer for a given input, they are active at multiple layers when aggregated over many inputs, with this multi-layer activity increasing in larger models. The authors also explored the effect of "tuned-lens" transformations on latent activations, ultimately providing a new method for understanding how representations evolve within transformers.

No reviews yet

Audiobook Categories

More to Explore

GETTING STARTED

Multi-Layer Sparse Autoencoders for Transformer Interpretation

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Multi-Layer Sparse Autoencoders for Transformer Interpretation

About this listen