AI: AX - introspection

Name: AI: AX - introspection
SKU: PD_8002_394815AU

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

AI: AX - introspection

By: mcgrof

Listen for free

Episodes View all

GoldenMagikCarp

Aug 9 2025

These two sources from LessWrong explore the phenomenon of "glitch tokens" within Large Language Models (LLMs) like GPT-2, GPT-3, and GPT-J. The authors, Jessica Rumbelow and mwatkins, detail how these unusual strings, often derived from web scraping of sources like Reddit or game logs, cause anomalous behaviors in the models, such as evasion, bizarre responses, or refusal to repeat the token. They hypothesize that these issues stem from the tokens being rarely or poorly represented in the models' training data, leading to unpredictable outcomes and non-deterministic responses, even at zero temperature. The second source provides further technical details and recent findings, categorizing these tokens and investigating their proximity to the embedding space centroid, offering deeper insights into this peculiar aspect of LLM functionality.

Sources:

1) February 2023 - https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
2) February 2023 - https://www.lesswrong.com/posts/Ya9LzwEbfaAMY8ABo/solidgoldmagikarp-ii-technical-details-and-more-recent

Show More Show Less

17 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Route Sparse Autoencoder to Interpret Large Language Models

Aug 9 2025

This paper introduces Route Sparse Autoencoder (RouteSAE), a novel framework designed to improve the interpretability of large language models (LLMs) by effectively extracting features across multiple layers. Traditional sparse autoencoders (SAEs) primarily focus on single-layer activations, failing to capture how features evolve through different depths of an LLM. RouteSAE addresses this by incorporating a routing mechanism that dynamically assigns weights to activations from various layers, creating a unified feature space. This approach leads to a higher number of interpretable features and improved interpretability scores compared to previous methods like TopK SAE and Crosscoder, while maintaining computational efficiency. The study demonstrates RouteSAE's ability to identify both low-level (e.g., "units of weight") and high-level (e.g., "more [X] than [Y]") features, enabling targeted manipulation of model behavior.

Source: May 2025 - Route Sparse Autoencoder to Interpret Large Language Models - https://arxiv.org/pdf/2503.08200

Show More Show Less

12 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
HarmBench: Automated Red Teaming for LLM Safety

Aug 9 2025

This paper introduces HarmBench, a new framework for evaluating the safety and robustness of large language models (LLMs) against malicious use. It highlights the growing concern over LLMs' potential for harm, such as generating malware or designing biological weapons, and emphasizes the need for automated red teaming—a process of identifying vulnerabilities—due to the scalability limitations of manual methods. HarmBench addresses the previous lack of standardized evaluation by offering a comprehensive benchmark with diverse harmful behaviors, including contextual and multimodal scenarios, and robust, comparable metrics for assessing attack success rates. The document also presents R2D2, a novel adversarial training method that leverages HarmBench to significantly improve LLM refusal mechanisms without compromising overall performance, ultimately aiming to foster safer AI development.

Source: February 2024 - https://arxiv.org/pdf/2402.04249 - HarmBench: A Standardized Evaluation Framework for
Automated Red Teaming and Robust Refusal

Show More Show Less

22 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free

No reviews yet

Audiobook Categories

More to Explore

GETTING STARTED

AI: AX - introspection

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

AI: AX - introspection

About this listen

GoldenMagikCarp

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Route Sparse Autoencoder to Interpret Large Language Models

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

HarmBench: Automated Red Teaming for LLM Safety

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed