How I found 10,000 GitHub repositories distributing Trojan malware

This is the story of how I found 10,000 repositories on GitHub that distribute Trojan malware. They are all from different contributors, have different names, and are not forks of other repositories. But they share a common pattern, which is what allowed me to write a script to find such repositories.

Introduction

I have a project on GitHub, and I wanted to check whether search engines had indexed it. I typed the project name into Google, and my repository appeared in the results. I entered the same query into Bing, and someone else’s repository appeared in the results, with the exact same name and description. It was a copy of my repository with all the commits, and I was listed as a contributor. But an hour ago, another commit was pushed with a change to the readme. A link to a zip archive has been added to it.

I was choosing appropriate tags for another one of my projects on GitHub. I clicked through those tags to look at similar projects. In the list, I found a repository whose name and description matched exactly those of another repository on that list. It turned out that it also contained copies of all the commits from that repository, and two hours ago, a link to a zip archive has been added to the readme.

After monitoring these two repositories, I discovered that every few hours they delete the previous commit and push the exact same commit again. This commit contains only one change: adding a link to the archive in the readme file.

I submitted a request to GitHub support asking them to delete these repositories. Two weeks passed and nothing has changed; GitHub support hasn’t responded. I discussed with an AI what else could be done about this, but it didn’t offer any useful advice. I opened a thread on GitHub, and three people replied with the same AI slop that was of no use at all.

Another month later, GitHub support sent me an email saying that they had removed these repositories.

You can open other similar repositories, look at the latest commit, and see that a link to a zip archive was added to the readme a few hours ago:
https://github.com/Dicrida123/java-sdk
https://github.com/A2A-MC/ccresume
https://github.com/1-RAY-1/project-startup-cursor
https://github.com/123abukhaled0/FinCoach

The zip archive contains 4 files:
- Application.cmd or Launcher.cmd
- loader.exe or luajit.exe or another_name.exe
- random_name.cso or random_name.txt
- lua51.dll

If you submit a link to the archive to VirusTotal, it will find 0 viruses.
If you submit the zip file itself, it will detect a Trojan inside it.

Continued

It seemed like I had already forgotten about this event, but my subconscious hadn’t. And my subconscious often throws interesting ideas at me when I’m sleeping or waking up. Recently, I woke up and in the very same second realized what I needed to do. I need to come up with a general pattern and then write a script that will analyze all GitHub repositories and find the ones that match that pattern.

Search pattern:
- Every few hours the previous commit is deleted and a new one is pushed
- Only the readme file is updated in the commit
- The readme file contains a link to a zip archive
- The commits are copied from another repository
- This is a new repository, not a fork
- All repositories have different contributors and different names

From the last two points, it becomes clear that even if we find one such repository, we won’t be able to find other similar repositories using it. But there are 500 million repositories on GitHub. How can we analyze all of them? GitHub allows 5,000 requests per hour with a single token. For each repository, we need to make several requests to get the list of commits, modified files, and the content of the readme file. I didn’t want to wait a year for the script to analyze all the repositories.

But we don’t need all the repositories, we only need the ones that are updated every few hours. I found a service called gharchive, which lets you download all GitHub events for any given day. So we need to download the event archives for the last few days, filter them to include only commit push events, and identify the repositories that are updated between 2 and 10 times every 10 hours.

Over the past 5 days, there have been 16 million commit pushes. Of these, only 3,000 are repositories that are updated every few hours.

However, the events do not include information about which specific files were modified. This means that for each relevant repository, we need to make additional requests to the GitHub API.

After running the script, it returned a large number of repositories. I added several parameters to the filters:
- The commit must be from a user, not a bot
- More than a month has passed between the last commit and the one before that
- The repositories have more than one contributor

After that, only 14 repositories were found that fully matched the pattern. And I couldn’t stop wondering: why were there so few repositories? What are the odds that I stumbled upon these repositories two months ago and there are only 14 of them on the entire GitHub? There should be many more. Imagine what the headline of this article would have been if I’d found a million such repositories or even just a thousand.

But I accepted the fact that there were only 14 of them and started writing this article. I decided to double-check them one more time so I wouldn’t accidentally include any unnecessary repositories in the article. Imagine my surprise when I saw that they had all been updated 20 hours ago. So the “updated every few hours” parameter was completely wrong. The filter had discarded all repositories that are updated infrequently.

During my manual check, I also noticed repositories that contained a link to a zip archive and had a recent commit, but that commit had zero changes. The filter, however, only considered repositories where a single readme file had been modified in the latest commit.

I also noticed that the last commit in all of these repositories had the same name: “Update README.md”.

I changed the filter. Now the script searched for repositories that were updated between 1 and 24 times every 24 hours. It found 40,000 such repositories.

There were 10,000 repositories that exactly matched the pattern. That’s 25% of the total.

Each of these repositories contains a zip archive with a Trojan.

These repositories have been around for many months, some even for over a year, and GitHub does not automatically detect and delete them.

I’ve published a complete list of these repositories on GitHub.
A script for finding such repositories: Git Malware Finder

Open Questions

  1. Why do they only clone new repositories, rather than popular ones?
  2. Why do they delete a commit and push a new one every few hours?
  3. Why doesn’t GitHub automatically detect such repositories?
  4. What exactly does the executable exe file from the archive do?
  5. What is the actual scale of this campaign?

My Hypotheses

The hackers’ goal is to understand how the system works, find its limitations and vulnerabilities, and exploit that information. If overwriting commits helps bypass GitHub’s security algorithms, then that’s what they did. Perhaps that’s also why every commit is named “Update README.md”.

The second goal is to spread the virus. How do they get people to find and download it? I think they do this by cloning only new repositories, which immediately appear at the top of search engine results for low-volume search terms. They also add these repositories to popular GitHub tags to increase the chances of indexing and to help people find them through those tags.

But why do they copy all the commits and contributors? After all, they could have just copied the entire source code. This is likely done to build trust. When someone visits a repository, they see the contributors, can click through to their profiles, and see that these aren’t one-day accounts. And the commit history is preserved so it’s clear that the repository didn’t just appear yesterday. But perhaps this is also done to bypass GitHub’s algorithms.

These are just my assumptions, but the reality may be completely different.

Conclusion

I was subject to GitHub’s API limit of 5,000 requests per hour. I optimized the script to search only for relevant repositories, and I think that because of the filter, the script found only a small percentage of repositories. The GitHub team does not have such limitations. They can analyze all 500 million repositories, find any archives or executable files within them, and scan them for viruses.

This time, I won’t be sending a request to GitHub. There are simply too many repositories. If any of you have direct contact with GitHub’s security team, please send them a link to this article.

* Update
I found this article from April 18: How 109 Fake GitHub Repositories Delivered SmartLoader and StealC
It explains in detail how this Trojan malware works. At that time, the author had found 109 such repositories.