Two Minds, One Model cover art

Two Minds, One Model

Two Minds, One Model

By: John Jezl and Jon Rocha
Listen for free

About this listen

Two Minds, One Model is a podcast dedicated to exploring topics in Machine Learning and Artificial Intelligence. Hosted by John Jezl and Jon Rocha, and recorded at Sonoma State University.John Jezl and Jon Rocha
Episodes
  • Circuit Tracing: Attribution Graphs and the Grammar of Neural Networks
    Dec 5 2025

    This episode explores how Anthropic researchers successfully scaled sparse autoencoders from toy models to Claude 3 Sonnet's 8 billion neurons, extracting 34 million interpretable features including ones for deception, sycophancy, and the famous Golden Gate Bridge example. The discussion emphasizes both the breakthrough achievement of making interpretability techniques work at production scale and the sobering limitations including 65% reconstruction accuracy, millions of dollars in compute costs, and the growing gap between interpretability research and rapid advances in model capabilities.

    Credits

    • Cover Art by Brianna Williams
    • TMOM Intro Music by Danny Meza

    A special thank you to these talented artists for their contributions to the show.

    Links and Reference

    Academic Papers

    • Circuit Tracing: Revealing Computational Graphs in Language Models - Anthropic (Mar, 2025)

    • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Anthropic (Oct, 2023)

    • Toy Models of Superposition” - Anthropic (December 2022)

    • "Alignment Faking in Large Language Models" - Anthropic (December 2024)

    • "Agentic Misalignment: How LLMs Could Be Insider Threats" - Anthropic (January 2025)

    • "Attention is All You Need" - Vaswani, et al (June, 2017)

    • In-Context Learning and Induction Heads - Anthropic (March 2022)

    News

    • Anthropic Project Fetch / Robot Dogs

    • Anduril's Fury unmanned fighter jet

    • MIT search and rescue robot navigation

    Abandoned Episode Titles

    • “Westworld But It's Just 10 Terabytes of RAM Trying to Understand Haiku”
    • “Star Trek: The Wrath of O(n⁴)”
    • “The Deception Is Coming From Inside the Network”
    • "We Have the Bestest Circuits”
    • “Lobotomy Validation: The Funnier, More Scientifically Sound Term”
    • “Seven San Franciscos Worth of Power and All We Got Was This Attribution Graph”

    Show More Show Less
    57 mins
  • 34 Million Features Later: What Researchers Found Inside Claude's World Model
    Nov 8 2025

    This episode explores how Anthropic researchers successfully scaled sparse autoencoders from toy models to Claude 3 Sonnet's 8 billion neurons, extracting 34 million interpretable features including ones for deception, sycophancy, and the famous Golden Gate Bridge example. The discussion emphasizes both the breakthrough achievement of making interpretability techniques work at production scale and the sobering limitations including 65% reconstruction accuracy, millions of dollars in compute costs, and the growing gap between interpretability research and rapid advances in model capabilities.

    Credits

    Cover Art by Brianna Williams

    TMOM Intro Music by Danny Meza

    A special thank you to these talented artists for their contributions to the show.


    Links and Reference

    ---------------------------------------------

    Academic Papers

    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html - Anthropic (May, 2024)

    Toy Models of Superposition “https://transformer-circuits.pub/2022/toy_model/index.html” - Anthropic (December 2022)

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - https://transformer-circuits.pub/2023/monosemantic-features - Anthropic (May 2024)

    Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking - Anthropic (December 2024)

    Agentic Misalignment: How LLMs Could Be Insider Threats - https://www.anthropic.com/research/agentic-misalignment - Anthropic (January 2025)

    News

    OpenAI-AMD Partnership Official announcement - https://ir.amd.com/news-events/press-releases/detail/1260/amd-and-openai-announce-strategic-partnership-to-deploy-6-gigawatts-of-amd-gpus

    OpenAI IPO Sources for $1 trillion valuation - https://seekingalpha.com/news/4510992-openai-eyes-record-breaking-1-trillion-ipo---report

    Hospital Bill Reduction Case study source of family using Claude AI to reduce $195K bill to $33K - https://www.tomshardware.com/tech-industry/artificial-intelligence/grieving-family-uses-ai-chatbot-to-cut-hospital-bill-from-usd195-000-to-usd33-000-family-says-claude-highlighted-duplicative-charges-improper-coding-and-other-violations

    Other

    GPT-5 Auto-routing OpenAI's model routing feature and user reception - https://fortune.com/2025/08/12/openai-gpt-5-model-router-backlash-ai-future/

    Abandoned Episode Titles

    "The Empire Scales Back: How We Found the Deception Star"

    "Fantastic Features and Where to Find Them: A 15-Million-X Adventure"

    whatever

    "The Fellowship of the Residual Stream: One Dictionary to Rule Them All"

    "65% of the Time, It Works Every Time: An Anchorman's Guide to AI Interpretability"


    Show More Show Less
    1 hr
  • Decomposing Superposition: Sparse Autoencoders for Neural Network Interpretability
    Nov 4 2025

    This episode explores how sparse autoencoders can decode the phenomenon of superposition in neural networks, demonstrating that the seemingly impenetrable compression of features into neurons can be partially reversed to extract interpretable, causal features. The discussion centers on an Anthropic research paper that successfully maps specific behaviors to discrete neural network locations in a 512-neuron model, proving that interpretability is achievable though computationally expensive, with important implications for AI safety and control mechanisms.

    Credits

    Cover Art by Brianna Williams

    TMOM Intro Music by Danny Meza

    A special thank you to these talented artists for their contributions to the show.

    Links and References---------------------------------------------------

    Academic PapersTowards Monosemanticity: Decomposing Language Models With Dictionary Learning - https://transformer-circuits.pub/2023/monosemantic-features - Anthropic (May 2024)

    Toy Models of Superposition “https://transformer-circuits.pub/2022/toy_model/index.html” - Anthropic (December 2022)

    Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking - Anthropic (December 2024)

    Agentic Misalignment: How LLMs Could Be Insider Threats - https://www.anthropic.com/research/agentic-misalignment - Anthropic (January 2025)

    News

    Deep Seek OCR Model Release - https://deepseek.ai/blog/deepseek-ocr-context-compression

    Meta AI Division Layoffs - https://www.nytimes.com/2025/10/22/technology/meta-plans-to-cut-600-jobs-at-ai-superintelligence-labs.html

    Apple M5 Chip Announcement - https://www.apple.com/newsroom/2025/10/apple-unleashes-m5-the-next-big-leap-in-ai-performance-for-apple-silicon/

    Anthropic Claude Haiku 4.5 - https://www.anthropic.com/news/claude-haiku-4-5

    Other

    Jon Stewart interview with Geoffrey Hinton - https://www.youtube.com/watch?v=jrK3PsD3APk

    Blake Lemoine and AI Psychosis - https://www.youtube.com/watch?v=kgCUn4fQTsc


    Abandoned Episode Titles

    • "Star Trek: The Wrath of Polysemanticity"

    • "The Hitchhiker's Guide to the Neuron: Don't Panic, It's Just Superposition"

      "Honey, I Shrunk the Features (Then Expanded Them 256x)"

      "The Legend of Zelda: 131,000 Links Between Neurons"

    Show More Show Less
    53 mins
No reviews yet
In the spirit of reconciliation, Audible acknowledges the Traditional Custodians of country throughout Australia and their connections to land, sea and community. We pay our respect to their elders past and present and extend that respect to all Aboriginal and Torres Strait Islander peoples today.