Source: AddMeshCube via Alamy Stock Photo
Hundreds of open source large language model (LLM) builder servers and dozens of vector databases are leaking highly sensitive information to the open Web.
As companies rush to integrate AI into their business workflows, they occasionally pay insufficient attention to how to secure these tools, and the information they trust them with. In a new report, Legit security researcher Naphtali Deutsch demonstrated as much by scanning the Web for two kinds of potentially vulnerable open source (OSS) AI services: vector databases — which store data for AI tools — and LLM application builders — specifically, the open source program Flowise. The investigation unearthed a bevy of sensitive personal and corporate data, unknowingly exposed by organizations stumbling to get in on the generative AI revolution.
"A lot of programmers see these tools on the Internet, then try to set them up in their environment," Deutsch says, but those same programmers are leaving security considerations behind.
Hundreds of Unpatched Flowise Servers
Flowise is a low-code tool for building all kinds of LLM applications. It's backed by Y Combinator, and sports tens of thousands of stars on GitHub.
Whether it be a customer support bot or a tool for generating and extracting data for downstream programming and other tasks, the programs that developers build with Flowise tend to access and manage large quantities of data. It's no wonder, then, that the majority of Flowise servers are password-protected.
A password, however, isn't security enough. Earlier this year, a researcher in India discovered an authentication bypass vulnerability in Flowise versions 1.6.2 and earlier, which can be triggered by simply capitalizing a few characters in the program's API endpoints. Tracked as CVE-2024-31621, the issue earned a "high" 7.6 score on the CVSS Version 3 scale.
By exploiting CVE-2024-31621, Legit's Deutsch cracked 438 Flowise servers. Inside were GitHub access tokens, OpenAI API keys, Flowise passwords and API keys in plaintext, configurations and prompts associated with Flowise apps, and more.
"With a GitHub API token, you can get access to private repositories," Deutsch emphasizes, as just one example of the kinds of follow-on attacks such data can enable. "We also found API keys to other vector databases, like Pinecone, a very popular SaaS platform. You could use those to get into a database, and dump all the data you found — maybe private and confidential data."
Tens of Unprotected Vector Databases
Vector databases store any kind of data an AI app might need to retrieve, in fact, and those accessible from the broader web can be attacked directly.
Using scanning tools, Deutsch discovered around 30 vector database servers online without any authentication checks whatsoever, containing obviously sensitive information: private email conversations from an engineering services vendor; documents from a fashion company; customer PII and financial information from an industrial equipment company; and more. Other databases contained real estate data, product documentation and data sheets, and patient information used by a medical chatbot.
Leaky vector databases are even more dangerous than leaky LLM builders, as they can be tampered with in such a way that does not alert the users of AI tools that rely on them. For example, instead of just stealing information from an exposed vector database, a hacker can delete or corrupt its data to manipulate its results. One could also plant malware within a vector database such that when an LLM program queries it, it ends up ingesting the malware.
To mitigate the risk of exposed AI tooling, Deutsch recommends that organizations restrict access to the AI services they rely on, monitor and log the activity associated with those services, protect sensitive data trafficked by LLM apps, and always apply software updates where possible.
"[These tools] are new, and people don't have as much knowledge about how to set them up," he warns. "And it's also getting easier to do — with a lot of these vector databases, it's two clicks to set it up in your Docker, or in your AWS Azure environment." Security is more cumbersome, and can lag behind.