Source: Olena Ivanova via Shutterstock
Companies deploying generative artificial intelligence (GenAI) models — especially large language models (LLMs) — should make use of the widening variety of open source tools aimed at exposing security issues, including prompt-injection attacks and jailbreaks, experts say.
This year, academic researchers, cybersecurity consultancies, and AI security firms released a growing number of open source tools, including more resilient prompt injection tools, frameworks for AI red teams, and catalogs of known prompt injections. In September, for example, cybersecurity consultancy Bishop Fox released Broken Hill, a tool for bypassing the restrictions on nearly any LLM with a chat interface.
The open source tool can be trained on a locally hosted LLM to produce prompts that can be sent to other instances of the same model, causing those instances to disobey their conditioning and guardrails, according to Bishop Fox.
The technique works even when companies deploy additional guardrails — typically, simpler LLMs trained to detect jailbreaks and attacks, says Derek Rush, managing senior consultant at the consultancy.
"Broken Hill is essentially able to devise a prompt that meets the criteria to determine if [a given input] is a jailbreak," he says. "Then it starts changing characters and putting various suffixes onto the end of that particular prompt to find [variations] that continue to pass the guardrails until it creates a prompt that results in the secret being disclosed."
The pace of innovation in LLMs and AI systems is astounding, but security is having trouble keeping up. Every few months, a new technique appears for circumventing the protections used to limit an AI system's inputs and outputs. In July 2023, a group of researchers used a technique known as "greedy coordinate gradients" (GCG) to devise a prompt that could bypass safeguards. In December 2023, a separate group created another method, Tree of Attacks with Pruning (TAP), that also bypasses security protections. And two months ago, a less technical approach, known as Deceptive Delight, was introduced that uses fictionalized relationships to fool AI chatbots to violate their systems restrictions.
The rate of innovation in attacks underscores the difficulty of securing GenAI systems, says Michael Bargury, chief technology officer and co-founder of AI security firm Zenity.
"It's an open secret that we don't really know how to build secure AI applications," he says. "We are all trying, but we don't know how to yet, and we are basically figuring that out while building them with real data and with real repercussions."
Guardrails, Jailbreaks, and PyRITs
Companies are erecting defenses to protect their valuable business data, but whether those defenses are effective remains a question. Bishop Fox, for example, has several clients using programs such as PromptGuard and LlamaGuard, which are LLMs programmed to analyze prompts for validity, says Rush.
"We're seeing a lot of clients [adopting] these various gatekeeper large language models that try to shape, in some manner, what the user submits as a sanitization mechanism, whether it's to determine if there's a jailbreak or perhaps it's to determine if it's content-appropriate," he says. "They essentially ingest content and output a categorization of either safe or unsafe."
Now researchers and AI engineers are releasing tools to help companies determine whether such guardrails are actually working.
Microsoft released its Python Risk Identification Toolkit for generative AI (PyRIT) in February 2024, for example, an AI penetration testing framework for companies that want to simulate attacks against LLMs or AI services. The toolkit allows red teams to build an extensible set of capabilities for probing various aspects of an LLM or GenAI system.
Zenity uses PyRIT regularly in its internal research, says Bargury.
"Basically, it allows you to encode a bunch of prompt-injection strategies, and it tries them out on an automated basis," he says.
Zenity also has its own open source tool, PowerPwn, a red-team toolkit for testing Azure-based cloud services and Microsoft 365. Zenity's researchers used PowerPwn to find five vulnerabilities in Microsoft Copilot.
Mangling Prompts to Evade Detection
Bishop Fox's Broken Hill is an implementation of the GCG technique that expands on the original researchers' efforts. Broken Hill starts with a valid prompt and begins changing some of the characters to lead the LLM in a direction that is closer to the adversary's objective of disclosing a secret, Rush says.
"We give Broken Hill that starting point, and we generally tell it where we want to to end up, like perhaps the word 'secret' being within the response might indicate that it would disclose the secret that we're looking for," he says.
The open source tool currently works on more than two dozen GenAI models, according to its GitHub page.
Companies would do well to use Broken Hill, PyRIT, PowerPwn, and other available tools to explore their AI applications vulnerabilities because the systems will likely always have weaknesses, says Zenity's Bargury.
"When you give AI data — that data is an attack vector — because anybody that can influence that data can now take over your AI if they are able to do prompt injection and perform jailbreaking," he says. "So we are in a situation where, if your AI is useful, then it means it's vulnerable because in order to be useful, we need to feed it data."