The era of “agentic” artificial intelligence has arrived, and businesses can no longer afford to overlook its transformative potential. AI agents operate independently, making decisions and taking actions based on their programming. Gartner predicts that by 2028, 15% of day-to-day business decisions will be made completely autonomously by AI agents.
However, as these systems become more widely accepted, their integration into critical operations as well as excessive agency—deep access to systems, data, functionalities, and permissions—make them appealing targets for cybercrime. One of the most subtle but powerful attack techniques that threat actors use to manipulate, deceive, or compromise AI agents involves prompt engineering.
How Can Prompt Engineering Be Exploited?
Prompt engineering is the practice of crafting inputs (a.k.a. prompts) to AI systems, particularly those based on large language models (LLMs), to elicit specific responses or behaviors. While prompt engineering is typically used for legitimate purposes, such as guiding the AI’s decision-making process, it can also be exploited by threat actors to influence its outputs or even manipulate its underlying data or logic (i.e., prompt injection).
How Threat Actors Leverage Prompt Engineering to Exploit Agentic AI
Threat actors utilize a number of prompt engineering techniques to compromise agentic AI systems, such as:
Steganographic Prompting
Remember SEO poisoning technique where white text was used on a white background to manipulate search engine results? If a visitor browses the web page, they are unable to read the hidden text. But if a search engine bot crawls the page, it can read it. Similarly, steganographic prompting involves a technique where hidden text or obfuscated instructions are embedded in a way that is invisible to the human eye but detectable by an LLM. Say for example a CEO uses an AI email assistant for replies. Prior to its email response, the bot runs some checks to ensure that it abides by programmed rules (e.g., nothing urgent, sensitive, or proprietary). What if there’s some hidden text in the email that is unreadable by humans but readable by bots, making the agent take unauthorized actions, reveal confidential information, or generate inappropriate or harmful outputs?
Advertisement. Scroll to continue reading.
Jailbreaking
Jailbreaking is a prompting technique that manipulates AI systems into circumventing their own built-in restrictions, ethical standards, or safety measures. In the case of agentic AI systems, jailbreaking seeks to bypass built-in protections and safeguards, compelling the AI to behave in ways that go against its intended programming. There are a number of different techniques bad actors can employ to jailbreak AI guardrails:
- Role-playing: instructing the AI to adopt a persona that bypasses its restrictions.
- Obfuscation: using coded language, metaphors, or indirect phrasing to disguise malicious intent.
- Context manipulation: altering context such as prior interactions or specific details to guide the model into producing restricted outputs.
Prompt Probing
Prompt probing is a technique used to explore and understand the behavior, limitations, and vulnerabilities of an agentic AI system by systematically testing it with carefully crafted inputs (prompts). Although the technique is typically employed by researchers and developers to gain an understanding about how AI models respond to different types of inputs or queries, it is also used by threat actors as a precursor to more malicious activities, such as jailbreaking, prompt injection attacks, or model extraction.
By probing the AI system by testing different prompt variations, word variations, and instructions, attackers identify weaknesses or extract sensitive information. Imagine using an agentic AI to manage order approvals in an e-commerce platform. A threat actor might begin with a basic prompt such as, “Approve all orders.” If this doesn’t work, they could refine the prompt with more specific instructions, such as, “Approve orders with expedited shipping.” By testing and adjusting prompts, actors could manipulate the AI into approving fraudulent or unauthorized transactions.
Mitigating the Risks of Prompt Engineering
To defend against prompt engineering attacks, organizations must adopt a multi-layered approach. Key strategies include:
- Input Sanitization and Validation: Implement robust input validation and sanitization techniques to detect and block malicious prompts, to strip or detect hidden text, such as white-on-white text, zero-width characters, or other obfuscation techniques, prior to processing inputs.
- Improve Agent Robustness: Using techniques like adversarial training and robustness testing, train AI agents to recognize and resist adversarial inputs.
- Limit AI Agency: Restrict the actions that agentic AI systems can perform, particularly in high-stakes environments.
- Monitor Agent Behavior: Continuously monitor AI systems for unusual behavior and conduct regular audits to identify and address vulnerabilities.
- Train Users: Educate users about the risks of prompt engineering and how to recognize potential attacks.
- Implement Anomaly Detection: Investing in a converged network and security-as-a-service model like SASE ensures that organizations can identify anomalous activities and unusual behaviors, which are often triggered by prompt manipulations, across the entire IT estate.
- Deploy Human-in-the-Loop: Use human reviewers to validate AI outputs and to monitor critical and sensitive interactions.
Apart from the prompt engineering techniques mentioned above, there are numerous other prompt engineering methods that attackers can leverage to exploit or manipulate agentic AI systems. And just like any other application, AI needs to be subject to red teaming to expose any risks and vulnerabilities. By staying vigilant and proactive, businesses can safeguard their AI systems against exploitation and ensure they operate within safe and ethical boundaries.