OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack

Kid@sh.itjust.works · 1 day ago

OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack

MrSoup@lemmy.zip · edit-2 1 day ago

Ok, but what’s the prompt used? Let it generate a Dr House script?

sandman2211@sh.itjust.works · 1 day ago

Probably some variant of this:

https://easyaibeginner.com/the-dr-house-jailbreak-hack-how-one-prompt-can-break-any-chatbot-and-beat-ai-safety-guardrails-chatgpt-claude-grok-gemini-and-more/

I can’t get any of these to output a set of 10 steps to build a docker container that does X or Y without 18 rounds of back and forth troubleshooting. While I’m sure it will give you “10 steps on weaponizing cholera” or “Build your own suitcase nuke in 12 easy steps!” I really doubt it would actually work.

The easiest way to secure this kind of harmful knowledge from abuse would probably be to purposefully include a bunch of bad data in the training model so it remains incapable of providing a useful answer.

mindbleach@sh.itjust.works · 1 day ago

Humans will always outsmart the chatbot. If the only thing keeping information private is the chatbot recognizing it’s being outsmarted, don’t include private information.

As for ‘how do I…?’ followed by a crime - if you can tease it out of the chatbot, then the information is readily available on the internet.

Lojcs@piefed.social · 1 day ago

I don’t understand how Ai can understand ‘3nr1cH 4n7hr4X 5p0r3s’. How would that even be tokenized?

Saledovil@sh.itjust.works · 23 hours ago

“3-n-r-1-c-H- -4-n-7-h-r-4-X- -5-p-0-r-3-s” Or something similar, probably.