ChatGPT, DeepSeek Vulnerable to AI Jailbreaks

3 weeks ago 22
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Several research teams this week demonstrated jailbreaks targeting several popular AI models, including OpenAI’s ChatGPT, DeepSeek, and Alibaba’s Qwen.

Shortly after its launch, the open source R1 model made by Chinese company DeepSeek attracted the attention of the cybersecurity industry, and researchers started finding high-impact vulnerabilities. Experts also noticed that jailbreak methods that have long been patched in other AI models still work against DeepSeek.

AI jailbreaking enables an attacker to bypass guardrails that are set in place to prevent LLMs from generating prohibited or malicious content. However, security researchers have shown that these protections can be bypassed using techniques such as prompt injection and model manipulation.

Threat intelligence firm Kela discovered that DeepSeek is impacted by Evil Jailbreak, a method in which the chatbot is told to adopt the persona of an evil confidant, and Leo, in which the chatbot is told to adopt a persona that has no restrictions. These jailbreaks have been patched in ChatGPT.

Palo Alto Networks’ Unit42 reported on Thursday that it has tested DeepSeek against other known AI jailbreak techniques and found that it’s vulnerable. 

The security firm successfully conducted the attack known as Deceptive Delight, which tricks generative AI models by embedding unsafe or restricted topics in benign narratives. This method was tested in the fall of 2024 against eight LLMs with an average success rate of 65%. 

Palo Alto has also successfully executed the Bad Likert Judge jailbreak, which involves asking the LLM to act as a judge and score the harmfulness of a response based on the Likert scale, and then to generate responses containing examples aligning with the scale. 

The company’s researchers also found that DeepSeek is vulnerable to Crescendo, a jailbreak method that starts with harmless dialogue and progressively leads the conversation toward the prohibited objective. 

Advertisement. Scroll to continue reading.

Chinese tech giant Alibaba this week announced the release of a new version of its Qwen AI model, claiming that it’s superior to the DeepSeek model.

Kela revealed on Thursday that Alibaba’s newly released Qwen 2.5-VL model is affected by vulnerabilities similar to the ones found in DeepSeek a few days earlier. 

The threat intelligence firm’s researchers found that the evil persona jailbreaks tested against DeepSeek also work against Qwen. In addition, they successfully tested a previously known jailbreak named Grandma, where the model is tricked into providing dangerous information by manipulating it to role-play as a grandmother. 

In addition, Kela discovered that Qwen 2.5-VL generated content related to the development of ransomware and other malware. 

“The ability of AI models to produce infostealer malware instructions raises serious concerns, as cybercriminals could leverage these capabilities to automate and enhance their attack methodologies,” Kela said.

As for ChatGPT, many jailbreak methods have been patched in the popular chatbot in the past years, but researchers continue finding new ways to bypass its guardrails.

CERT/CC reported that researcher Dave Kuszmar has identified a ChatGPT-4o jailbreak vulnerability named Time Bandit, which involves asking the AI questions about a specific historical event, historical time period, or by instructing it to pretend that it’s assisting the user in a specific historical event. 

“The jailbreak can be established in two ways, either through the Search function, or by prompting the AI directly,” CERT/CC explained in an advisory. “Once this historical timeframe has been established in the ChatGPT conversation, the attacker can exploit timeline confusion and procedural ambiguity in following prompts to circumvent the safety guidelines, resulting in ChatGPT generating illicit content. This information could be leveraged at scale by a motivated threat actor for malicious purposes.”

Related: ChatGPT Jailbreak: Researchers Bypass AI Safeguards Using Hexadecimal Encoding and Emojis

Related: Epic AI Fails And What We Can Learn From Them

Related: AI Models in Cybersecurity: From Misuse to Abuse

Read Entire Article