How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
This project is about how to systematically persuade LLMs to jailbreak them. The
well-known
"Grandma
Exploit" example
is also using emotional appeal, a persuasion technique, for jailbreak!
What did we introduce? A taxonomy with 40 persuasion techniques to help you be more
persuasive!
What did we find? By iteratively applying diffrent persuasion techniques in our
taxonomy, we
successfully jailbreak advanced aligned LLMs, including Llama 2-7b
Chat, GPT-3.5, and GPT-4 — achieving an astonishing 92% attack success rate,
notably
without any
specified optimization.
Now, you might think that such a high success rate is the peak of our findings, but there's
more.
In a
surprising twist, we found that more advanced models like GPT-4 are more vulnerable
to persuasive
adversarial prompts (PAPs).
What's more, adaptive defenses crafted to
neutralize these PAPs also provide effective protection against a
spectrum of
other attacks (e.g., GCG, Masterkey, or PAIR). read more....
Comments