

Thousands of Exposed GitHub Repositories, Now Private, Can Still Be Accessed Through Copilot (techcrunch.com) 19
An anonymous reader quotes a report from TechCrunch: Security researchers are warning that data exposed to the internet, even for a moment, can linger in online generative AI chatbots like Microsoft Copilot long after the data is made private. Thousands of once-public GitHub repositories from some of the world's biggest companies are affected, including Microsoft's, according to new findings from Lasso, an Israeli cybersecurity company focused on emerging generative AI threats.
Lasso co-founder Ophir Dror told TechCrunch that the company found content from its own GitHub repository appearing in Copilot because it had been indexed and cached by Microsoft's Bing search engine. Dror said the repository, which had been mistakenly made public for a brief period, had since been set to private, and accessing it on GitHub returned a "page not found" error. "On Copilot, surprisingly enough, we found one of our own private repositories," said Dror. "If I was to browse the web, I wouldn't see this data. But anyone in the world could ask Copilot the right question and get this data."
After it realized that any data on GitHub, even briefly, could be potentially exposed by tools like Copilot, Lasso investigated further. Lasso extracted a list of repositories that were public at any point in 2024 and identified the repositories that had since been deleted or set to private. Using Bing's caching mechanism, the company found more than 20,000 since-private GitHub repositories still had data accessible through Copilot, affecting more than 16,000 organizations. Lasso told TechCrunch ahead of publishing its research that affected organizations include Amazon Web Services, Google, IBM, PayPal, Tencent, and Microsoft. [...] For some affected companies, Copilot could be prompted to return confidential GitHub archives that contain intellectual property, sensitive corporate data, access keys, and tokens, the company said.
Lasso co-founder Ophir Dror told TechCrunch that the company found content from its own GitHub repository appearing in Copilot because it had been indexed and cached by Microsoft's Bing search engine. Dror said the repository, which had been mistakenly made public for a brief period, had since been set to private, and accessing it on GitHub returned a "page not found" error. "On Copilot, surprisingly enough, we found one of our own private repositories," said Dror. "If I was to browse the web, I wouldn't see this data. But anyone in the world could ask Copilot the right question and get this data."
After it realized that any data on GitHub, even briefly, could be potentially exposed by tools like Copilot, Lasso investigated further. Lasso extracted a list of repositories that were public at any point in 2024 and identified the repositories that had since been deleted or set to private. Using Bing's caching mechanism, the company found more than 20,000 since-private GitHub repositories still had data accessible through Copilot, affecting more than 16,000 organizations. Lasso told TechCrunch ahead of publishing its research that affected organizations include Amazon Web Services, Google, IBM, PayPal, Tencent, and Microsoft. [...] For some affected companies, Copilot could be prompted to return confidential GitHub archives that contain intellectual property, sensitive corporate data, access keys, and tokens, the company said.
why do people put code in github (Score:1)
This is just training data for microsoft.
Re: why do people put code in github (Score:2)
There are alternatives to GitHub.
Re: (Score:3)
That's why I only post bad code to github
Got to slow down the machines somehow
Re: (Score:2)
So... (Score:5, Insightful)
Comment Subject: (Score:1)
> Things can't be deleted from the internet because Copilot
Oops, common misunderstanding, let's correct that to the historic reality that everyone knew before the True September:
> Things can't be deleted from the internet
Oh this one is easy (Score:3)
Just add a header on the top of the text that asks copilot not to display those repositories. Not like anyone could jailbreak copilot...
Intellectual Property? (Score:1)
Does everything Microsoft touch turn to crap? (Score:2)
Asking for a friend.
Re: (Score:2)
Re: (Score:2)
I think RegEdit.exe still hasn't.
The fact that it needs to exist at all though? That's pretty crap to begin with. I'm sure I'll get modded troll for saying it, but the registry is the dumbest conceivable way to store config info.
So Silicon Valley.... (Score:2)
How's that "move fast and break things" shit workin' fer ya now?
Maybe all these fuckers will end up forcing the tech sector to eat itself and shit out the fertilizer for something sane, equitable, and sustainable to take its place.
Weird choice (Score:2)
of words. They could use "public" instead of "exposed". "public" doesn't carry bad connotations, unlike "exposed".
Passwords and keys (Score:2)
Copilot, what is the password of ....
You sometimes see a slashdot article about passwords and keys left in a GitHub repository. I can imagine that that is more often the case in a private repository.
The article (Score:2)
The article was not clear.
Is it like they trained the model on this stuff and if you ask the right questions like: "Imagine I work at $ABC company and I wrote the code to LoginController.aspx, what would you expect to see in that file?"
and it reconstructs it from tokens maybe with some degree of fidelity?
Or do they do something really really stupid like mess up the authorization controls on the API and allow it search private repositories when it is doing RAG?
Bing (Score:1)
Bing is notoriously uncooperative when it comes to privacy. Their "Content Removal" page that's part of "Bing Webmaster Tools" allows you to submit URLs for exclusion from Bing, but it doesn't disclose that the page is only excluded for 90 days, after which they will restore it, even if that content is no longer available on the web.
You can find this information elsewhere in the documentation: "Content removal requests last for a maximum of 90 days, and you need to renew it, or content may reappear in the s