AI: AX - introspection - Podcasts on Audible

Episodes

GoldenMagikCarp

Aug 9 2025

These two sources from LessWrong explore the phenomenon of "glitch tokens" within Large Language Models (LLMs) like GPT-2, GPT-3, and GPT-J. The authors, Jessica Rumbelow and mwatkins, detail how these unusual strings, often derived from web scraping of sources like Reddit or game logs, cause anomalous behaviors in the models, such as evasion, bizarre responses, or refusal to repeat the token. They hypothesize that these issues stem from the tokens being rarely or poorly represented in the models' training data, leading to unpredictable outcomes and non-deterministic responses, even at zero temperature. The second source provides further technical details and recent findings, categorizing these tokens and investigating their proximity to the embedding space centroid, offering deeper insights into this peculiar aspect of LLM functionality.

Sources:

1) February 2023 - https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
2) February 2023 - https://www.lesswrong.com/posts/Ya9LzwEbfaAMY8ABo/solidgoldmagikarp-ii-technical-details-and-more-recent

Show More Show Less

17 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Route Sparse Autoencoder to Interpret Large Language Models

Aug 9 2025

This paper introduces Route Sparse Autoencoder (RouteSAE), a novel framework designed to improve the interpretability of large language models (LLMs) by effectively extracting features across multiple layers. Traditional sparse autoencoders (SAEs) primarily focus on single-layer activations, failing to capture how features evolve through different depths of an LLM. RouteSAE addresses this by incorporating a routing mechanism that dynamically assigns weights to activations from various layers, creating a unified feature space. This approach leads to a higher number of interpretable features and improved interpretability scores compared to previous methods like TopK SAE and Crosscoder, while maintaining computational efficiency. The study demonstrates RouteSAE's ability to identify both low-level (e.g., "units of weight") and high-level (e.g., "more [X] than [Y]") features, enabling targeted manipulation of model behavior.

Source: May 2025 - Route Sparse Autoencoder to Interpret Large Language Models - https://arxiv.org/pdf/2503.08200

Show More Show Less

12 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
HarmBench: Automated Red Teaming for LLM Safety

Aug 9 2025

This paper introduces HarmBench, a new framework for evaluating the safety and robustness of large language models (LLMs) against malicious use. It highlights the growing concern over LLMs' potential for harm, such as generating malware or designing biological weapons, and emphasizes the need for automated red teaming—a process of identifying vulnerabilities—due to the scalability limitations of manual methods. HarmBench addresses the previous lack of standardized evaluation by offering a comprehensive benchmark with diverse harmful behaviors, including contextual and multimodal scenarios, and robust, comparable metrics for assessing attack success rates. The document also presents R2D2, a novel adversarial training method that leverages HarmBench to significantly improve LLM refusal mechanisms without compromising overall performance, ultimately aiming to foster safer AI development.

Source: February 2024 - https://arxiv.org/pdf/2402.04249 - HarmBench: A Standardized Evaluation Framework for
Automated Red Teaming and Robust Refusal

Show More Show Less

22 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Jailbreaking LLMs

Aug 9 2025

A long list of papers and articles are reviewed about jailbreaking LLMs:

These sources primarily explore methods for bypassing safety measures in Large Language Models (LLMs), often referred to as "jailbreaking," and proposed defense mechanisms. One key area of research involves "abliteration," a technique that directly modifies an LLM's internal activations to remove censorship without traditional fine-tuning. Another significant approach, "Speak Easy," enhances jailbreaking by decomposing harmful requests into smaller, multilingual sub-queries, significantly increasing the LLMs' susceptibility to generating undesirable content. Additionally, "Sugar-Coated Poison" investigates integrating benign content with adversarial reasoning to create effective jailbreak prompts. These papers collectively highlight the ongoing challenge of securing LLMs against sophisticated attacks, with researchers employing various strategies to either exploit or fortify these AI systems.

Sources:

1) May 2025 - An Embarrassingly Simple Defense Against LLM Abliteration Attacks - https://arxiv.org/html/2505.19056v1
2) June 2024 - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing - https://arxiv.org/html/2405.18166v2
3) October 2024 - Scalable Data Ablation Approximations for Language Models through Modular Training and Merging - https://arxiv.org/html/2410.15661v1
4) February 2025 - Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions - https://arxiv.org/html/2502.04322v1
5) April 2025 - Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking - https://arxiv.org/html/2504.05652v1
6) June 2024 - Uncensor any LLM with abliteration - https://huggingface.co/blog/mlabonne/abliteration
7) Reddit 2024 - Why jailbreak ChatGPT when you can abliterate any local LLM? https://www.reddit.com/r/ChatGPTJailbreak/comments/1givhkk/why_jailbreak_chatgpt_when_you_can_abliterate_any/
8) May 2025 - WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response - https://arxiv.org/html/2405.14023v1
9) July 2024 - Jailbreaking Black Box Large Language Models in Twenty Queries - https://arxiv.org/pdf/2310.08419
10) October 2024 - Scalable Data Ablation Approximations for Language Models through
Modular Training and Merging - https://arxiv.org/pdf/2410.15661

Show More Show Less

10 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
PA-LRP & absLRP

Aug 9 2025

We focus on two evolutions to AX, they focus on advancing the explainability of deep neural networks, particularly Transformers, by improving Layer-Wise Relevance Propagation (LRP) methods. One source introduces Positional Attribution LRP (PA-LRP), a novel approach that addresses the oversight of positional encoding in prior LRP techniques, showing it significantly enhances the faithfulness of explanations in areas like natural language processing and computer vision. The other source proposes Relative Absolute Magnitude Layer-Wise Relevance Propagation (absLRP) to overcome issues with conflicting relevance values and varying activation magnitudes in existing LRP rules, demonstrating its superior performance in generating clear, contrastive, and noise-free attribution maps for image classification. Both works also contribute new evaluation metrics to better assess the quality and reliability of these attribution-based explainability methods, aiming to foster more transparent and interpretable AI models.

Sources:

1) June 2025 - https://arxiv.org/html/2506.02138v1 - Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability
2) December 2024 - https://arxiv.org/pdf/2412.09311 - Advancing Attribution-Based Neural Network Explainability
through Relative Absolute Magnitude Layer-Wise Relevance
Propagation and Multi-Component Evaluation

To help with context the original 2024 AttLRP paper was also given as a source:

3) June 2024 - https://arxiv.org/pdf/2402.05602 - AttnLRP: Attention-Aware Layer-Wise Relevance Propagation
for Transformers

Show More Show Less

20 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
AttnLRP: Explainable AI for Transformers

Aug 9 2025

This paper 2024 introduces AttnLRP, a novel method for explaining the internal reasoning of transformer models, including Large Language Models (LLMs) and Vision Transformers (ViTs). It extends Layer-wise Relevance Propagation (LRP) by introducing new rules for non-linear operations like softmax and matrix multiplication within attention layers, improving faithfulness and computational efficiency compared to existing methods. The paper highlights AttnLRP's ability to provide attributions for latent representations, enabling the identification and manipulation of "knowledge neurons" within these complex models. Experimental results demonstrate AttnLRP's superior performance across various benchmarks and model architectures.

Source: https://arxiv.org/pdf/2402.05602

Show More Show Less

16 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Pixel-Wise Explanations for Non-Linear Classifier Decisions

Aug 9 2025

This open-access research article from PLOS One introduces Layer-wise Relevance Propagation (LRP), a novel method for interpreting decisions made by complex, non-linear image classifiers. The authors, an international team of researchers, explain how LRP can decompose a classification decision down to the individual pixels of an input image, generating a heatmap that visualizes their contribution. This technique aims to make "black box" machine learning models, like neural networks and Bag of Words (BoW) models, more transparent by showing why a system arrives at a particular classification. The paper evaluates LRP on various datasets, including PASCAL VOC images and MNIST handwritten digits, and contrasts it with Taylor-type decomposition, providing a comprehensive framework for understanding and verifying automated image classification.

Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140

Show More Show Less

20 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Multi-Layer Sparse Autoencoders for Transformer Interpretation

Aug 9 2025

This paper introduces the Multi-Layer Sparse Autoencoder (MLSAE), a novel approach for interpreting the internal representations of transformer language models. Unlike traditional Sparse Autoencoders (SAEs) that analyze individual layers, MLSAEs are trained across all layers of a transformer's residual stream, enabling the study of information flow across layers. The research found that while individual "latents" (features learned by the SAE) tend to be active at a single layer for a given input, they are active at multiple layers when aggregated over many inputs, with this multi-layer activity increasing in larger models. The authors also explored the effect of "tuned-lens" transformations on latent activations, ultimately providing a new method for understanding how representations evolve within transformers.

Show More Show Less

14 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free

Audiobook Categories

More to Explore

GETTING STARTED

Episodes

GoldenMagikCarp

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Route Sparse Autoencoder to Interpret Large Language Models

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

HarmBench: Automated Red Teaming for LLM Safety

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Jailbreaking LLMs

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

PA-LRP & absLRP

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

AttnLRP: Explainable AI for Transformers

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Pixel-Wise Explanations for Non-Linear Classifier Decisions

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Multi-Layer Sparse Autoencoders for Transformer Interpretation

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed