"Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations" by Subhash Kantamneni, kitft, Euan Ong, Sam Marks cover art

"Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations" by Subhash Kantamneni, kitft, Euan Ong, Sam Marks

"Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations" by Subhash Kantamneni, kitft, Euan Ong, Sam Marks

Listen for free

View show details

Summary

Abstract

We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.

We apply NLAs to model auditing. During our pre-deployment audit of Claude Opus 4.6, NLAs helped diagnose safety-relevant behaviors and surfaced unverbalized evaluation awareness—cases where Claude believed, but did not say, that it was being evaluated. We present these audit findings as case studies and corroborate them using independent methods. On an automated auditing benchmark requiring end-to-end investigation of an intentionally-misaligned model, NLA-equipped agents outperform baselines and can succeed even without access to the misaligned model's training data.

NLAs offer a convenient interface for interpretability, with expressive natural language explanations that we can directly read. To support further work, we release training code and trained NLAs [...]

---

Outline:

(00:15) Abstract

[... 6 more sections]

---

First published:
May 7th, 2026

Source:
https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised

---



Narrated by TYPE III AUDIO.

---

Images from the article:

adbl_web_anon_alc_button_suppression_c
No reviews yet
In the spirit of reconciliation, Audible acknowledges the Traditional Custodians of country throughout Australia and their connections to land, sea and community. We pay our respect to their elders past and present and extend that respect to all Aboriginal and Torres Strait Islander peoples today.