HarmBench: Automated Red Teaming for LLM Safety

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

HarmBench: Automated Red Teaming for LLM Safety

Listen for free

View show details

About this listen

This paper introduces HarmBench, a new framework for evaluating the safety and robustness of large language models (LLMs) against malicious use. It highlights the growing concern over LLMs' potential for harm, such as generating malware or designing biological weapons, and emphasizes the need for automated red teaming—a process of identifying vulnerabilities—due to the scalability limitations of manual methods. HarmBench addresses the previous lack of standardized evaluation by offering a comprehensive benchmark with diverse harmful behaviors, including contextual and multimodal scenarios, and robust, comparable metrics for assessing attack success rates. The document also presents R2D2, a novel adversarial training method that leverages HarmBench to significantly improve LLM refusal mechanisms without compromising overall performance, ultimately aiming to foster safer AI development.

Source: February 2024 - https://arxiv.org/pdf/2402.04249 - HarmBench: A Standardized Evaluation Framework for

Automated Red Teaming and Robust Refusal

No reviews yet

Audiobook Categories

More to Explore

GETTING STARTED

HarmBench: Automated Red Teaming for LLM Safety

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

HarmBench: Automated Red Teaming for LLM Safety

About this listen