“Alignment remains a hard, unsolved problem” by null
Failed to add items
Sorry, we are unable to add the item because your shopping cart is already at capacity.
Add to basket failed.
Please try again later
Add to Wish List failed.
Please try again later
Remove from Wish List failed.
Please try again later
Follow podcast failed
Unfollow podcast failed
-
Narrated by:
-
By:
About this listen
Though there are certainly some issues, I think most current large language models are pretty well aligned. Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it'd be a pretty close call. So, overall, I'm quite positive on the alignment of current models! And yet, I remain very worried about alignment in the future. This is my attempt to explain why that is.
What makes alignment hard?
I really like this graph from Christopher Olah for illustrating different levels of alignment difficulty:
If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues—that is, issues like reward hacking or agentic misalignment where there is a straightforward behavioral alignment issue that we can detect and evaluate—then we are very much [...]
---
Outline:
(01:04) What makes alignment hard?
(02:36) Outer alignment
(04:07) Inner alignment
(06:16) Misalignment from pre-training
(07:18) Misaligned personas
(11:05) Misalignment from long-horizon RL
(13:01) What should we be doing?
---
First published:
November 27th, 2025
Source:
https://www.lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a-hard-unsolved-problem
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
No reviews yet
In the spirit of reconciliation, Audible acknowledges the Traditional Custodians of country throughout Australia and their connections to land, sea and community. We pay our respect to their elders past and present and extend that respect to all Aboriginal and Torres Strait Islander peoples today.