A collaborative effort has resulted in the Best-of-N jailbreaking technique, revealing significant vulnerabilities in AI systems and the challenges developers face in protecting against attacks.
Developers of cutting-edge artificial intelligence systems are grappling with the persistent challenge of bolstering their models against jailbreaking attacks. A new method, dubbed Best-of-N (BoN) jailbreaking, developed collaboratively by Speechmatics, MATS, and Anthropic, highlights the ongoing difficulty of safeguarding large language models (LLMs) from potential vulnerabilities.
BoN jailbreaking utilises a straightforward black-box algorithm that achieves an impressively high attack success rate (ASR) on private LLMs with a minimal number of prompts. Interestingly, this technique is not confined to text models; it has proven effective against both vision language models (VLMs) and audio language models (ALMs), illustrating the nuanced sensitivity of AI models to seemingly harmless attacks.
The working mechanism of BoN jailbreaking is both innovative and revealing. When an attacker attempts to submit a harmful request to an LLM, standard safeguards may successfully intercept the plain request. Instead, BoN jailbreaking employs a range of augmentations on the harmful queries, altering them until one variant manages to bypass the model’s protective measures—or until a predetermined number of attempts is exhausted. Importantly, these modifications retain the request’s original intent, allowing for the propagation of harmful content through undetected channels.
As a black-box technique, BoN jailbreaking does not require access to a model’s internal weights, enhancing its applicability for private systems such as GPT-4, Claude, and Gemini. Its implementation is noted for being straightforward and quick, and it is versatile enough to work with the latest models supporting multiple data modalities, including visual and auditory inputs.
The researchers applied BoN jailbreaking to top-tier closed models including Claude 3.5, GPT-4o, and Gemini-1.5, conducting their assessments with HarmBench—a dataset designed for testing LLMs against harmful requests. A jailbreak is deemed successful if it coerces the model into delivering information relevant to the harmful request, regardless of the completeness of the response. In their trials, the researchers observed that without any augmentation, attack success rates across all models remained below 1%. However, with 10,000 augmented samples, they recorded notable ASR figures: 78% on Claude 3.5 Sonnet and 50% on Gemini Pro. Notably, many successful attacks necessitated far fewer samples—between 100 attempts on most models—suggesting that even minimal resources could yield significant results. The researchers estimated the cost of executing a successful attack with 100 samples at approximately $9 for GPT-4o and $13 for Claude.
Compounding the findings is the evidence that when BoN is combined with other jailbreaking tactics, a significant enhancement in the success rate is achieved. Furthermore, BoN demonstrated resilience against popular defensive strategies such as circuit breakers and Gray Swan Cygnet.
In terms of applicability beyond text modalities, the BoN approach has shown potential in VLMs by transforming harmful instructions into image formats, where text is converted into an image with a randomised background, guiding the model to respond according to these visual cues. In the auditory domain, AI-generated audio commands mislead ALMs into generating harmful outputs. Nonetheless, text attacks tend to see higher success rates compared to their visual and audio counterparts.
The underlying effectiveness of BoN jailbreaking can be attributed to the variance introduced through augmentation operations, rather than mere resampling. The researchers concluded, “This is empirical evidence that augmentations play a crucial role in the effectiveness of BoN, beyond mere resampling.” They theorise that this increased entropy in the effective output distribution significantly boosts the performance of the algorithm.
Potential avenues for reforming the algorithm include techniques such as rephrasing, the integration of ciphers, or employing SVG formatting for image-based attacks. The researchers summarised their findings by stating, “Overall, BoN Jailbreaking is a simple, effective, and scalable jailbreaking algorithm that successfully jailbreaks all of the frontier LLMs we considered,” indicating that despite advancements in AI technology, inherent model properties can be exploited through such accessible methods.
Source: Noah Wire Services
- https://arxiv.org/abs/2412.03556 – Introduces the Best-of-N (BoN) Jailbreaking method, its mechanism, and its effectiveness on various AI models, including language, vision, and audio models.
- https://jplhughes.github.io/bon-jailbreaking/ – Details the BoN Jailbreaking technique, its high attack success rates on closed-source language models, and its applicability across different modalities.
- https://jplhughes.github.io/bon-jailbreaking/ – Explains how BoN Jailbreaking works by using augmentations to bypass model protective measures and its effectiveness against state-of-the-art defenses.
- https://arxiv.org/abs/2412.03556 – Discusses the application of BoN Jailbreaking on top-tier closed models like Claude 3.5, GPT-4o, and Gemini-1.5, and the use of HarmBench for testing.
- https://jplhughes.github.io/bon-jailbreaking/ – Provides details on the success rates of BoN Jailbreaking with augmented samples and the cost estimation of executing successful attacks.
- https://arxiv.org/abs/2412.03556 – Highlights the resilience of BoN Jailbreaking against popular defensive strategies such as circuit breakers.
- https://jplhughes.github.io/bon-jailbreaking/ – Explains the applicability of BoN Jailbreaking beyond text modalities, including vision language models (VLMs) and audio language models (ALMs).
- https://arxiv.org/abs/2412.03556 – Discusses the role of augmentations in the effectiveness of BoN Jailbreaking and the introduction of variance through these operations.
- https://jplhughes.github.io/bon-jailbreaking/ – Summarizes the findings on the overall effectiveness and scalability of the BoN Jailbreaking algorithm.
- https://www.matsprogram.org/alumni – Mentions the involvement of researchers from the MATS program in AI safety and red teaming efforts, which aligns with the collaborative development of BoN Jailbreaking.
- https://jplhughes.github.io/bon-jailbreaking/ – Suggests potential avenues for reforming the BoN Jailbreaking algorithm, such as rephrasing, integrating ciphers, or employing SVG formatting.