Source: flood eye35 via Alamy Stock Photo
Cyberattackers in just the last few months have registered more than 100,000 — but by some estimates more than a million — malicious copycat repositories on GitHub.
The "repo confusion" scheme is simple: programmatically copying, Trojanizing, and reuploading existing repos, hoping that developers download the wrong one.
GitHub's automatic security mechanisms seem to be identifying and removing the majority of these cheap fakes, but according to new research from Apiiro, plenty are still seeping through the cracks.
Anatomy of a Repository Confusion Attack
Repo confusion works just like dependency confusion in package managers, tricking unwitting developers into downloading near-identical copies of the code they actually want, with malware quietly added as a bonus.
This malware, in turn, becomes incorporated into software projects and causes downstream supply chain risks.
The key to success with this latest campaign is automation. The attacker has been cloning, infecting, and reuploading repositories automatically at scale, pushing what researchers estimate are millions of repositories in all. And to add legitimacy, the automation process forks these projects thousands of times apiece, and promotes them across various Web forums and apps.
So when sleep-deprived or multitasking developers fork the copycat instead of the original, they'll be served a heavily obfuscated copy of the BlackCap Grabber, which collects credentials from various apps, browser cookies, and other data, in addition to other malicious functions.
GitHub, for its part, has been taking down most of these malicious repos within hours of their posting.
"However, the automation detection seems to miss many repos, and the ones that were uploaded manually survive. Because the whole attack chain seems to be mostly automated on a large scale, the 1% that survive still amount to thousands of malicious repos," Apiiro explained in its blog post.
A GitHub spokesperson said the organization is working on extracting the malicious code. "GitHub hosts over 100M developers building across over 420M repositories, and is committed to providing a safe and secure platform for developers. We have teams dedicated to detecting, analyzing, and removing content and accounts that violate our Acceptable Use Policies. We employ manual reviews and at-scale detections that use machine learning and constantly evolve and adapt to adversarial tactics," the spokesperson said in a statement. "We also encourage customers and community members to report abuse and spam."
Why GitHub Is Used for Confusion Attacks
GitHub by nature offers certain advantages for confusion attacks. "The ease of automatic generation of accounts and repos on GitHub and alike, using comfortable APIs and soft rate limits that are easy to bypass, combined with the huge number of repos to hide among, make it a perfect target for covertly infecting the software supply chain," Apiiro wrote.
Shawn Loveland, chief operating officer of Resecurity, points out two additional problems. "One's a tradeoff of privacy versus security: GitHub's not looking at repos, but then criminals can leverage them," Loveland says. "And the other one is just the sheer number of GitHub accounts that are compromised, which allows bad actors to get into private repos and then go off and make duplicates."
Cybercriminals can also copy public repos without this extra access.
"I just looked in our database," Loveland notes. "Almost 100,000 PCs of users logging in to GitHub were infected with malware in the last 90 days."
How can organizations protect themselves from both direct and downstream effects of a malicious GitHub repo? "Companies need to have a policy about using GitHub [that is] communicated with their employees and vendors, even if they themselves don't use GitHub," he suggests, because even companies that don't directly engage with third-party code rely on developers at some point in their supply chains.
"Even a company that doesn't have anyone using GitHub can still be victimized," Loveland says.