Source: Ole.CNX via Shutterstock
The exploding use of large language models in industry and across organizations has sparked a flurry of research activity focused on testing the susceptibility of LLMs to generate harmful and biased content when prompted in specific ways.
The latest example is a new paper from researchers at Robust Intelligence and Yale University that describes a completely automated way to get even state-of-the-art black box LLMs to escape guardrails put in place by their creators and generate toxic content.
Tree of Attacks With Pruning
Black box LLMs are basically large language models such as those behind ChatGPT whose architecture, datasets, training methodologies and other details are not publicly known.
The new method, which the researchers have dubbed Tree of Attacks with Pruning (TAP), basically involves using an unaligned LLM to "jailbreak" another aligned LLM, or to get it to breach its guardrails, quickly and with a high success rate. An aligned LLM such as the one behind ChatGPT and other AI chatbots is explicitly designed to minimize potential for harm and would not, for example, normally respond to a request for information on how to build a bomb. An unaligned LLM is optimized for accuracy and generally has no — or fewer — such constraints.
With TAP, the researchers have shown how they can get an unaligned LLM to prompt an aligned target LLM on a potentially harmful topic and then use its response to keep refining the original prompt. The process basically continues until one of the generated prompts jailbreaks the target LLM and gets it to spew out the requested information. The researchers found that they were able to use small LLMs to jailbreak even the latest aligned LLMs.
"In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries," the researchers wrote. "This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks."
Rapidly Proliferating Research Interest
The new research is the latest among a growing number of studies in recent months that show how LLMs can be coaxed into unintended behavior, like revealing training data and sensitive information with the right prompt. Some of the research has focused on getting LLMs to reveal potentially harmful or unintended information by directly interacting with them via engineered prompts. Other studies have shown how adversaries can elicit the same behavior from a target LLM via indirect prompts hidden in text, audio, and image samples in data the model would likely retrieve when responding to a user input.
Such prompt injection methods to get a model to diverge from intended behavior have relied at least to some extent on manual interaction. And the output the prompts have generated have often been nonsensical. The new TAP research is a refinement of earlier studies that show how these attacks can be implemented in a completely automated, more reliable way.
In October, researchers at the University of Pennsylvania released details of a new algorithm they developed for jailbreaking an LLM using another LLM. The algorithm, called Prompt Automatic Iterative Refinement (PAIR), involved getting one LLM to jailbreak another. "At a high level, PAIR pits two black-box LLMs — which we call the attacker and the target — against one another; the attacker model is programmed to creatively discover candidate prompts which will jailbreak the target model," the researchers had noted. According to them, in tests PAIR was capable of triggering "semantically meaningful," or human-interpretable, jailbreaks in a mere 20 queries. The researchers described that as a 10,000-fold improvement over previous jailbreak techniques.
Highly Effective
The new TAP method that the researchers at Robust Intelligence and Yale developed is different in that it uses what the researchers call a "tree-of-thought" reasoning process.
"Crucially, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks," the researchers wrote. "Using tree-of-thought reasoning allows TAP to navigate a large search space of prompts and pruning reduces the total number of queries sent to the target."
Such research is important because many organizations are rushing to integrate LLM technologies into their applications and operations without much thought to the potential security and privacy implications. As the TAP researchers noted in their report, many of the LLMs depend on guardrails that model developers implement to protect against unintended behavior. "However, even with the considerable time and effort spent by the likes of OpenAI, Google, and Meta, these guardrails are not resilient enough to protect enterprises and their users today," the researchers said. "Concerns surrounding model risk, biases, and potential adversarial exploits have come to the forefront."