Episodes

  • GoldenMagikCarp
    Aug 9 2025

    These two sources from LessWrong explore the phenomenon of "glitch tokens" within Large Language Models (LLMs) like GPT-2, GPT-3, and GPT-J. The authors, Jessica Rumbelow and mwatkins, detail how these unusual strings, often derived from web scraping of sources like Reddit or game logs, cause anomalous behaviors in the models, such as evasion, bizarre responses, or refusal to repeat the token. They hypothesize that these issues stem from the tokens being rarely or poorly represented in the models' training data, leading to unpredictable outcomes and non-deterministic responses, even at zero temperature. The second source provides further technical details and recent findings, categorizing these tokens and investigating their proximity to the embedding space centroid, offering deeper insights into this peculiar aspect of LLM functionality.


    Sources:


    1) February 2023 - https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

    2) February 2023 - https://www.lesswrong.com/posts/Ya9LzwEbfaAMY8ABo/solidgoldmagikarp-ii-technical-details-and-more-recent

    Show More Show Less
    17 mins
  • Route Sparse Autoencoder to Interpret Large Language Models
    Aug 9 2025

    This paper introduces Route Sparse Autoencoder (RouteSAE), a novel framework designed to improve the interpretability of large language models (LLMs) by effectively extracting features across multiple layers. Traditional sparse autoencoders (SAEs) primarily focus on single-layer activations, failing to capture how features evolve through different depths of an LLM. RouteSAE addresses this by incorporating a routing mechanism that dynamically assigns weights to activations from various layers, creating a unified feature space. This approach leads to a higher number of interpretable features and improved interpretability scores compared to previous methods like TopK SAE and Crosscoder, while maintaining computational efficiency. The study demonstrates RouteSAE's ability to identify both low-level (e.g., "units of weight") and high-level (e.g., "more [X] than [Y]") features, enabling targeted manipulation of model behavior.


    Source: May 2025 - Route Sparse Autoencoder to Interpret Large Language Models - https://arxiv.org/pdf/2503.08200

    Show More Show Less
    12 mins
  • HarmBench: Automated Red Teaming for LLM Safety
    Aug 9 2025


    This paper introduces HarmBench, a new framework for evaluating the safety and robustness of large language models (LLMs) against malicious use. It highlights the growing concern over LLMs' potential for harm, such as generating malware or designing biological weapons, and emphasizes the need for automated red teaming—a process of identifying vulnerabilities—due to the scalability limitations of manual methods. HarmBench addresses the previous lack of standardized evaluation by offering a comprehensive benchmark with diverse harmful behaviors, including contextual and multimodal scenarios, and robust, comparable metrics for assessing attack success rates. The document also presents R2D2, a novel adversarial training method that leverages HarmBench to significantly improve LLM refusal mechanisms without compromising overall performance, ultimately aiming to foster safer AI development.


    Source: February 2024 - https://arxiv.org/pdf/2402.04249 - HarmBench: A Standardized Evaluation Framework for

    Automated Red Teaming and Robust Refusal

    Show More Show Less
    22 mins
  • Jailbreaking LLMs
    Aug 9 2025

    A long list of papers and articles are reviewed about jailbreaking LLMs:


    These sources primarily explore methods for bypassing safety measures in Large Language Models (LLMs), often referred to as "jailbreaking," and proposed defense mechanisms. One key area of research involves "abliteration," a technique that directly modifies an LLM's internal activations to remove censorship without traditional fine-tuning. Another significant approach, "Speak Easy," enhances jailbreaking by decomposing harmful requests into smaller, multilingual sub-queries, significantly increasing the LLMs' susceptibility to generating undesirable content. Additionally, "Sugar-Coated Poison" investigates integrating benign content with adversarial reasoning to create effective jailbreak prompts. These papers collectively highlight the ongoing challenge of securing LLMs against sophisticated attacks, with researchers employing various strategies to either exploit or fortify these AI systems.


    Sources:


    1) May 2025 - An Embarrassingly Simple Defense Against LLM Abliteration Attacks - https://arxiv.org/html/2505.19056v1

    2) June 2024 - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing - https://arxiv.org/html/2405.18166v2

    3) October 2024 - Scalable Data Ablation Approximations for Language Models through Modular Training and Merging - https://arxiv.org/html/2410.15661v1

    4) February 2025 - Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions - https://arxiv.org/html/2502.04322v1

    5) April 2025 - Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking - https://arxiv.org/html/2504.05652v1

    6) June 2024 - Uncensor any LLM with abliteration - https://huggingface.co/blog/mlabonne/abliteration

    7) Reddit 2024 - Why jailbreak ChatGPT when you can abliterate any local LLM? https://www.reddit.com/r/ChatGPTJailbreak/comments/1givhkk/why_jailbreak_chatgpt_when_you_can_abliterate_any/

    8) May 2025 - WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response - https://arxiv.org/html/2405.14023v1

    9) July 2024 - Jailbreaking Black Box Large Language Models in Twenty Queries - https://arxiv.org/pdf/2310.08419

    10) October 2024 - Scalable Data Ablation Approximations for Language Models through

    Modular Training and Merging - https://arxiv.org/pdf/2410.15661


    Show More Show Less
    10 mins
  • PA-LRP & absLRP
    Aug 9 2025

    We focus on two evolutions to AX, they focus on advancing the explainability of deep neural networks, particularly Transformers, by improving Layer-Wise Relevance Propagation (LRP) methods. One source introduces Positional Attribution LRP (PA-LRP), a novel approach that addresses the oversight of positional encoding in prior LRP techniques, showing it significantly enhances the faithfulness of explanations in areas like natural language processing and computer vision. The other source proposes Relative Absolute Magnitude Layer-Wise Relevance Propagation (absLRP) to overcome issues with conflicting relevance values and varying activation magnitudes in existing LRP rules, demonstrating its superior performance in generating clear, contrastive, and noise-free attribution maps for image classification. Both works also contribute new evaluation metrics to better assess the quality and reliability of these attribution-based explainability methods, aiming to foster more transparent and interpretable AI models.


    Sources:


    1) June 2025 - https://arxiv.org/html/2506.02138v1 - Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability

    2) December 2024 - https://arxiv.org/pdf/2412.09311 - Advancing Attribution-Based Neural Network Explainability

    through Relative Absolute Magnitude Layer-Wise Relevance

    Propagation and Multi-Component Evaluation


    To help with context the original 2024 AttLRP paper was also given as a source:


    3) June 2024 - https://arxiv.org/pdf/2402.05602 - AttnLRP: Attention-Aware Layer-Wise Relevance Propagation

    for Transformers



    Show More Show Less
    20 mins
  • AttnLRP: Explainable AI for Transformers
    Aug 9 2025

    This paper 2024 introduces AttnLRP, a novel method for explaining the internal reasoning of transformer models, including Large Language Models (LLMs) and Vision Transformers (ViTs). It extends Layer-wise Relevance Propagation (LRP) by introducing new rules for non-linear operations like softmax and matrix multiplication within attention layers, improving faithfulness and computational efficiency compared to existing methods. The paper highlights AttnLRP's ability to provide attributions for latent representations, enabling the identification and manipulation of "knowledge neurons" within these complex models. Experimental results demonstrate AttnLRP's superior performance across various benchmarks and model architectures.


    Source: https://arxiv.org/pdf/2402.05602

    Show More Show Less
    16 mins
  • Pixel-Wise Explanations for Non-Linear Classifier Decisions
    Aug 9 2025

    This open-access research article from PLOS One introduces Layer-wise Relevance Propagation (LRP), a novel method for interpreting decisions made by complex, non-linear image classifiers. The authors, an international team of researchers, explain how LRP can decompose a classification decision down to the individual pixels of an input image, generating a heatmap that visualizes their contribution. This technique aims to make "black box" machine learning models, like neural networks and Bag of Words (BoW) models, more transparent by showing why a system arrives at a particular classification. The paper evaluates LRP on various datasets, including PASCAL VOC images and MNIST handwritten digits, and contrasts it with Taylor-type decomposition, providing a comprehensive framework for understanding and verifying automated image classification.


    Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140

    Show More Show Less
    20 mins
  • Multi-Layer Sparse Autoencoders for Transformer Interpretation
    Aug 9 2025

    This paper introduces the Multi-Layer Sparse Autoencoder (MLSAE), a novel approach for interpreting the internal representations of transformer language models. Unlike traditional Sparse Autoencoders (SAEs) that analyze individual layers, MLSAEs are trained across all layers of a transformer's residual stream, enabling the study of information flow across layers. The research found that while individual "latents" (features learned by the SAE) tend to be active at a single layer for a given input, they are active at multiple layers when aggregated over many inputs, with this multi-layer activity increasing in larger models. The authors also explored the effect of "tuned-lens" transformations on latent activations, ultimately providing a new method for understanding how representations evolve within transformers.

    Show More Show Less
    14 mins