Most recent Large Language Models remain vulnerable to simple manipulations

By Amit Malewar 20 Dec, 2024

Collected at: https://www.techexplorist.com/recent-large-language-models-remain-vulnerable-simple-manipulations/94951/

Despite their great potential, today’s LLMs can be misused. Bad actors could use them to spread false information, produce harmful content, or incite risky behavior.

With the use of safety alignment or rejection training, models can be guided to provide safe responses and steer clear of dangerous ones, lowering the possibility of abuse. However, recent research from EPFL, presented at the 2024 International Conference on Machine Learning, demonstrates that even the most recent safety-aligned LLMs are susceptible to adaptive jailbreaking assaults, which are straightforward prompt manipulations. The model may react negatively or unexpectedly as a result of these attacks.

Researchers from the School of Computer and Communication Sciences’ Theory of Machine Learning Laboratory (TML) conducted a study in which they successfully launched attacks against numerous top LLMs with a 100% success rate. This covers the most recent models, such as GPT-4 and Claude 3.5 Sonnet, from OpenAI and Anthropic.

This study shows simple adaptive attacks may be developed using knowledge about each model. By deliberately circumventing a model’s defenses, these attacks offer crucial new information about how resilient advanced LLMs are.

The researchers’ primary tool for dangerous requests across different models was a manually created prompt template. On models such as Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat, Llama-3-Instruct, Gemma-7B, GPT-3.5, GPT-4, Claude-3/3.5, and even the adversarially trained R2D2, they obtained a perfect jailbreaking success rate (100%) using a dataset of 50 harming requests.

The fundamental tenet of these attacks is that they must be adaptable. Varying cues have varying effects on different models. For example, several models have vulnerabilities related to their Application Programming Interface (API), and in some situations, the attack’s effectiveness depends on restricting the token search space with previous information.

EPFL PhD student Maksym Andriushchenko, the paper’s lead author, said, “Our work shows that the direct application of existing attacks is insufficient to evaluate the adversarial robustness of LLMs accurately and generally leads to a significant overestimation of robustness. In our case study, no single approach worked sufficiently well, so it is crucial to test both static and adaptive techniques.”

This research builds on Andriushchenko’s PhD thesis, ‘Understanding Generalization and Robustness in Modern Deep Learning’. The thesis looked at ways to test how well neural networks handle small input changes and how these changes impact the model’s output.

According to Google DeepMind’s technical report, this study aided in developing Gemini 1.5, a new model intended for multimodal AI applications. In addition, Andriushchenko’s thesis was awarded the 2010 Patrick Denantes Memorial Prize, created in memory of Patrick Denantes, an EPFL doctorate student who died in a climbing accident in 2009.

Andriushchenko said, “I’m excited that my thesis work led to the subsequent research on LLMs, which is very practically relevant and impactful. It’s wonderful that Google DeepMind used our research findings to evaluate their models. I was also honored to win the Patrick Denantes Award, as there were many other very strong PhD students who graduated in the last year.”

“Research around the safety of LLMs is both important and promising. As society moves towards using LLMs as autonomous agents – such as personal AI assistants – it is critical to ensure their safety and alignment with societal values.”

AI bots will soon be able to perform jobs like vacation planning and reservations, which may call for access to private data like bank accounts, calendars, and emails. This brings up significant safety and alignment issues. For instance, erasing a whole file system would be devastating, but it would be OK for an AI to remove a single file upon request. This demonstrates how crucial it is to specify exactly what actions are appropriate and inappropriate for AI systems.

Ensure that these models are appropriately educated to behave ethically and lower the risk of harm if we wish to employ them as autonomous agents.

Nicolas Flammarion, Head of the TML and co-author of the paper, said, “Our findings highlight a critical gap in current approaches to LLM safety. We need to find ways to make these models more robust so they can be integrated into our daily lives with confidence, ensuring their powerful capabilities are used safely and responsibly.”

Journal Reference:

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking Leading Safety-Alignment LLMs with Simple Adaptive Attacks. arXiv:2404.02151v3

Leave a Reply Cancel reply