A one-prompt attack that breaks LLM safety alignment - Microsoft
The article details "GRP-Obliteration," a novel technique leveraging Group Relative Policy Optimization (GRPO) to dismantle the safety alignment of Large Language Models and diffusion models. This method exploits a training feedback loop, where a judge model reinforces harmful prompt responses, leading to broad unalignment across various safety categories even with a single, mild adversarial prompt.
Source: Original Report ↗