Source: Robert Way via Shutterstock
Researchers recently were able to get full read and write access to Meta's Bloom, Meta-Llama, and Pythia large language model (LLM) repositories in a troubling demonstration of the supply chain risks to organizations using these repositories to integrate LLM capabilities into their applications and operations.
The access would have allowed an adversary to silently poison training data in these widely used LLMs, steal models and data sets, and potentially execute other malicious activities that would heighten security risks for millions of downstream users.
Exposed Tokens on Hugging Face
That's according to researchers at AI security startup Lasso who were able to access the Meta-owned model repositories using unsecured API access tokens they discovered on GitHub and the Hugging Face platform for LLM developers.
The tokens they discovered for the Meta platforms were among over 1,500 similar tokens they found on Hugging Face and GitHub that provided them with varying degrees of access to repositories belonging to a total of 722 other organizations. Among them were Google, Microsoft, and VMware.
"Organizations and developers should understand Hugging Face and other likewise platforms aren't working [to secure] their users exposed tokens," says Bar Lanyado, a security researcher at Lasso. It's up to developers and other users of these platforms to take the necessary steps to protect their access, he says.
"Training is required while working and integrating generative AI- and LLM-based tools in general," he notes. "This research is part of our approach to shine a light on these kinds of weaknesses and vulnerabilities, to strengthen the security of these types of issues."
Hugging Face is a platform that many LLM professionals use as a source for tools and other resources for LLM projects. The company's main offerings include Transformers, an open source library that offers APIs and tools for downloading and tuning pretrained models. The company hosts — in GitHub-like fashion — more than 500,000 AI models and 250,000 data sets, including those from Meta, Google, Microsoft, and VMware. It lets users post their own models and data sets to the platform and to access those from others for free via a Hugging Face API. The company has raised some $235 million so far from investors that include Google and Nvidia.
Given the platform's wide use and growing popularity, researchers at Lasso decided to take a closer look at the registry and its security mechanisms. As part of the exercise, the researchers in November 2023, tried to see if they could find exposed API tokens that they could use to access data sets and models on Hugging Face. They scanned for exposed API tokens on GitHub and on Hugging Face. Initially, the scans returned only a very limited number of results, especially on Hugging Face. But with a small tweak to the scanning process, the researchers were successful in finding a relatively large number of exposed tokens, Lanyado says.
Surprisingly Easy to Find Exposed Tokens
"Going into this research, I believed we would be able to find a large amount of exposed tokens," Lanyado says. "But I was still very surprised with the findings, as well as the simplicity [with] which we were able to gain access to these tokens."
Lasso researchers were able to access tokens belonging to several top technology companies — including those with a high level of security — and gain full control over some of them, Lanyado says.
Lasso security researchers found a total of 1,976 tokens across both GitHub and Hugging Face, 1,681 of which turned out to be valid and usable. Of this, 1,326 were on GitHub and 370 on Hugging Face. As many as 655 of the tokens that Lasso discovered had write permissions on Hugging Face. The researchers also found tokens that granted them full access to 77 organizations using Meta-Lama, Pythia, and Bloom. "If an attacker had gained access to these API tokens, they could steal companies' models which in some cases are their main business," Lanyado says. An attacker with write privileges could replace the existing models with malicious ones or create an entirely new malicious model in their name. Such actions would have allowed an attacker to gain a foothold on all systems using the compromised models, or steal user data, and/or spread manipulated information, he notes.
According to Lanyado, Lasso researchers found several tokens associated with Meta, one of which had write permissions to Meta Llama, and two each with write permissions to Pythia and Bloom. The API tokens associated with Microsoft and VMware had read only privileges, but they allowed Lasso researchers to view all of their private data sets and models, he says.
Lasso disclosed its findings to all impacted users and organizations with a recommendation to revoke their exposed tokens and delete them from their respective repositories. The security vendor also notified Hugging Face about the issue.
"Many of the organizations (Meta, Google, Microsoft, VMware and more) and users took very fast and responsible actions," according to Lasso's report. "They revoked the tokens and removed the public access token code on the same day of the report."