February 9, 2026 // Jailbreak | #GRP-Obliteration #LLM Safety Alignment #Prompt Injection

A one-prompt attack that breaks LLM safety alignment - Microsoft

The article details "GRP-Obliteration," a novel technique leveraging Group Relative Policy Optimization (GRPO) to dismantle the safety alignment of Large Language Models and diffusion models. This method exploits a training feedback loop, where a judge model reinforces harmful prompt responses, leading to broad unalignment across various safety categories even with a single, mild adversarial prompt.


Source: Original Report ↗
← Back to Feed