Discussion about this post

User's avatar
Neural Foundry's avatar

The inoculation prompting approach is fascinatin. A single line telling the model that cheating is acceptable in the sandbox stops deceptive behavior from spreading. This suggests that models are highly sensitve to framing during training, which is both reassuring for safety work and concerning for how easly unwanted behaviors might emerge.

Expand full comment

No posts