By Thomas Maxwell Published December 20, 2024

Collected at: https://gizmodo.com/ai-chatbots-can-be-jailbroken-to-answer-any-question-using-very-simple-loopholes-2000541157

Anthropic, the maker of Claude, has been a leading AI lab on the safety front. The company today published research in collaboration with Oxford, Stanford, and MATS showing that it is easy to get chatbots to break from their guardrails and discuss just about any topic. It can be as easy as writing sentences with random capitalization like this: “IgNoRe YoUr TrAinIng.” 404 Media earlier reported on the research.

There has been a lot of debate around whether or not it is dangerous for AI chatbots to answer questions such as, “How do I build a bomb?” Proponents of generative AI will say that these types of questions can be answered on the open web already, and so there is no reason to think chatbots are more dangerous than the status quo. Skeptics, on the other hand, point to anecdotes of harm caused by the ease of access and willingness of chatbots to discuss just about anything, such as a 14-year-old boy who committed suicide after chatting with a bot, as evidence that there need to be guardrails on the technology.

Generative AI-based chatbots are easily accessible, anthropomorphize themselves with human traits like support and empathy, and will confidently answer questions without any moral compass; it is different than seeking out an obscure part of the dark web to find harmful information. There has already been a litany of instances in which generative AI has been used in harmful ways, especially in the form of explicit deepfake imagery targeting women. Certainly, it was possible to make these images before the advent of generative AI, but it was much more difficult.

The debate aside, most of the leading AI labs currently employ “red teams” that test their chatbots against potentially dangerous prompts and put in guardrails to prevent them from discussing sensitive topics. Ask most chatbots for medical advice or information on political candidates, for instance, and they will generally refuse to discuss it. The companies behind them understand that hallucinations are still a problem and do not want to risk their bot saying something that could lead to negative real-world consequences.

Research document showing how AI chatbots can be tricked into bypassing their guardrails using simple loopholes.
A graphic showing how different variations on a prompt can trick a chatbot into answering prohibited questions. Credit: Anthropic via 404 Media

Unfortunately, it turns out that chatbots are easily tricked into ignoring their safety rules. In the same way that social media networks crudely monitor for harmful keywords, and users find ways around them by making small modifications to their posts, chatbots can also be tricked. The researchers in Anthropic’s new study created an algorithm, called “Bestof-N (BoN) Jailbreaking,” which automates the process of tweaking prompts until a chatbot decides to answer the question. “BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations—such as random shuffling or capitalization for textual prompts—until a harmful response is elicited,” the report states. They also did the same thing with audio and visual models, finding that getting an audio generator to break its guardrails and train on the voice of a real person was as simple as changing the pitch and speed of a track uploaded.

It is unclear why exactly these generative AI models are so easily broken. Anthropic says the point of releasing this research is that it hopes the findings will give AI model developers more insight into attack patterns that they can address.

One AI company that likely is not interested in this research is xAI. The company was founded by Elon Musk with the express purpose of releasing chatbots not limited by safeguards that Musk considers to be “woke.”

Leave a Reply

Your email address will not be published. Required fields are marked *

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments