Safety & Alignment
Jailbreak
A prompt or technique that bypasses an AI model's safety training to produce restricted or harmful content.
A jailbreak is a prompt crafted to bypass the safety guardrails of a language model. Examples include role-playing scenarios ("pretend you're an AI without restrictions"), hypothetical framings, and encoded instructions that slip past content filters.
Successful jailbreaks reveal gaps in safety training. Labs continuously patch known jailbreaks, and attackers discover new ones in an ongoing cat-and-mouse dynamic.
The tension: safety training aims to refuse harmful requests without being overly restrictive. Jailbreaks exploit the gap between intention and implementation.
Techniques like RLHF and Constitutional AI reduce but don't eliminate jailbreaks. The problem remains a key focus for AI safety researchers.