Data From Deleted GitHub Repos May Not Actually Be Deleted, Researchers Claim (theregister.com) 23

Posted by BeauHD on Friday July 26, 2024 @06:00PM from the FYI dept.

Thomas Claburn reports via The Register: Researchers at Truffle Security have found, or arguably rediscovered, that data from deleted GitHub repositories (public or private) and from deleted copies (forks) of repositories isn't necessarily deleted. Joe Leon, a security researcher with the outfit, said in an advisory on Wednesday that being able to access deleted repo data -- such as APIs keys -- represents a security risk. And he proposed a new term to describe the alleged vulnerability: Cross Fork Object Reference (CFOR). "A CFOR vulnerability occurs when one repository fork can access sensitive data from another fork (including data from private and deleted forks)," Leon explained.

For example, the firm showed how one can fork a repository, commit data to it, delete the fork, and then access the supposedly deleted commit data via the original repository. The researchers also created a repo, forked it, and showed how data not synced with the fork continues to be accessible through the fork after the original repo is deleted. You can watch that particular demo [here].

According to Leon, this scenario came up last week with the submission of a critical vulnerability report to a major technology company involving a private key for an employee GitHub account that had broad access across the organization. The key had been publicly committed to a GitHub repository. Upon learning of the blunder, the tech biz nuked the repo thinking that would take care of the leak. "They immediately deleted the repository, but since it had been forked, I could still access the commit containing the sensitive data via a fork, despite the fork never syncing with the original 'upstream' repository," Leon explained. Leon added that after reviewing three widely forked public repos from large AI companies, Truffle Security researchers found 40 valid API keys from deleted forks. GitHub said it considers this situation a feature, not a bug: "GitHub is committed to investigating reported security issues. We are aware of this report and have validated that this is expected and documented behavior inherent to how fork networks work. You can read more about how deleting or changing visibility affects repository forks in our [documentation]."

Truffle Security argues that they should reconsider their position "because the average user expects there to be a distinction between public and private repos in terms of data security, which isn't always true," reports The Register. "And there's also the expectation that the act of deletion should remove commit data, which again has been shown to not always be the case."

Data From Deleted GitHub Repos May Not Actually Be Deleted, Researchers Claim

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 23 Comments Log In/Create an Account

Comments Filter:

Security "researcher" discovers how git works... (Score:5, Insightful)

by gweihir ( 88907 ) writes: on Friday July 26, 2024 @06:12PM (#64658686)

Seriously. Yes, this may cause problems, but the very idea of git and any version control system, really, is that you can always access every earlier commit unless you nuke everything. In the case of Git, which is a _distributed_ version control system, unless you delete _all_ copies of the repository, you cannot reliably delete anything. This is indeed working as expected and is no surprise at all.
Now, what about API keys? Simple: Same as secret keys, passwords, etc. they have _zero_ business getting checked into a version control system. They belong into a key management system and nowhere else. So what do you do when you have committed a cryptographic secret to a version control system? Also simple: You must invalidate and change it, no exceptions.

- Re: (Score:1)
  
  by hsthompson69 ( 1674722 ) writes:
  
  Gah. Wait till they find out blockchains maintain 100% transaction history that can be traced.
  Sigh.
  There is no deleting history in git, unless you can get access to *every* clone ever made.
  Good luck.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Gah. Wait till they find out blockchains maintain 100% transaction history that can be traced.
    And worse, that basically everybody can insert stuff into public blockchains, often anonymously.
- Re: (Score:2)
  
  by OverlordQ ( 264228 ) writes:
  
  is that you can always access every earlier commit unless you nuke everything.
  If I fork your repo and only commit to my repo, delete my repo and never push commits to yours, how is my commits showing up in your repo how git works?
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    That is ths _idea_. As I clearly wrote. That is not what you always get.
- Re:Security "researcher" discovers how git works.. (Score:5, Interesting)
  
  by test321 ( 8891681 ) writes: on Friday July 26, 2024 @06:34PM (#64658754)
  
  From the summary (and from the video), one of their experiment is to first create a fork of a repo, commit something, then delete the fork. There was never a pull request, the data in the deleted fork was never merged into the original repo. It should not be accessible. But if you know the commit ID, it is somehow still accessible. This is not a feature of git. It is a feature of GitHub website owners not implementing deletion as "rm -rf" on the fork folder, but as hiding the holder from the view, but still letting the data leak if you know an exact path (through the commit ID).
  
  - Re: (Score:2)
    
    by codebase7 ( 9682010 ) writes:
    
    Which is probably part of some backup mechanism for protection against malicious commits / or accidental deletion of a private repo.
    
    I.e. If you do the stupid thing of mindlessly clicking through all of the warnings and delete the repo, then decide you want it back. Github instantly performing "rm -rf" on the repo would mean that repo is irretrievable. Which is what you said you wanted, but being a corporate entity in the US means that lawsuits are inevitable if Github were to tell you that outright. Just
    - Re: (Score:2)
      
      by test321 ( 8891681 ) writes:
      
      1) "rm -rf" was just an rhetoric example. I did not really expect them to implement github as a shell scripts. But even if we keep the simple script analogy, they certainly could "move the folder to the trash bin" (recoverable for some time) which would still make the file structure inaccessible, as one would expect.
      being a corporate entity in the US means that lawsuits are inevitable
      The damage caused by accidental deletion is zero. You obviously have a copy of your github repo on your local computers, you just need to push it to github again. Github certainly has an elabora
  - Re: (Score:3)
    
    by gweihir ( 88907 ) writes:
    
    If you expect reliable deletion from a version control system, then you are doing it very wrong. It is contrary to the very nature of the thing and can never be more than a hack. Same, incidentally, if you place cryptographic secrets in any kind of commit in a version control system. Stop arguing and start using the right tools for the job.
- Re: (Score:1)
  
  by ssdfl ( 123261 ) writes:
  
  Any security company who thinks a good security model is to tell the accidental publisher of API keys to stop publishing them rather than fix the idiot who gave them to the publisher needs to have its reputation ruined. This is not a security model.
  Github should extract all the API keys and put them on a web page with the heading "If your API key is in this list, you need to invalidate it and change it."
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Indeed. Doing this is a rather striking proof of incompetence and non-working processes. API keys have no place in a version control system (in regulated environments it may even be illegal to put them there), and once you have put one in there, the only sane course of action is to invalidate them immediately and change them in all systems. And that applies to test-keys as well, since they can end up working in production and data in test environments may still be sensitive. API keys belong into special con
Uh, Github isn't wrong (Score:4, Insightful)

by Yoda's Mum ( 608299 ) writes: on Friday July 26, 2024 @06:18PM (#64658708)

I hate to agree with Github, but the entire point of a fork is that it's a fully separate copy. Commits in the parent repository messing with forks breaks most of the reasons for them in the first place.

- Re: (Score:2)
  
  by OverlordQ ( 264228 ) writes:
  
  Except that's not it, Commits in the fork show up in the parent also even if the commits were never pushed there.
- Re:Uh, Github isn't wrong (Score:5, Insightful)
  
  by serviscope_minor ( 664417 ) writes: on Saturday July 27, 2024 @07:04AM (#64659542) Journal
  
  I hate to agree with Github, but the entire point of a fork is that it's a fully separate copy.
  I don't think that's going on, and you can kind of see from the bugs something about github's implementation. Strap in, like everything in git, the abstractions leak out all over the place :) And this is a simplified explanation.
  If you don't understand git's internals and haven't read this doc, I would encourage you to do so, it's excellent:
  https://tom.preston-werner.com... [preston-werner.com]
  Basically in git everything is referred to by hashes. Files contents are hashed and referred to by that hash. A tree a text file which is basically a list of filenames along with hashes: that's how it knows what data each file contains. A commit is a text file, too. It contains a list of 0 or more parent commits (referred to by hash), a tree (referred to by hash) and metadata like a message. The entire lot is hashed. So a commit is a hash too.
  The point of all of that is that each of those things is completely immutable and cannot be changed. If you change something, then all of the hashes change and you get a new file, new tree and new commit.
  The mutable part of git is the branch references. This is again a list that says things like "main da765678af6e6ae76e092", and that's how you know what commit main is on. If you do a git commit, you generate a new commit (i.e. hash) and then update the branch so that it says main is the new hash now.
  So what?
  Well typically when you clone a repo, you pull down all the immutable data (i.e. blobs of data with their hash) and you might get a list of branches along with their hashes. Often the branch names are modified from. If the branch name is "foo" and you cloned from a server called "origin", in your copy, the branch will be called "remotes/origin/foo". But if you clone a local repository, it won't download anything, it just makes hardlinks to the blobs of data back in the original repository?
  Why? Well that data is immutable, so why copy it?
  All a repo clone (fork in github parlance) is is a private copy of that file which tracks branch names and some way of downloading the data by hash (and maybe storing new data by hash as well).
  I think what github are doing is only ever having a single collection of blobs-with-hashes. When you fork, you get a new file which tracks branches, but the same collection of blobs-with-hashes. This makes sense, it's mostly (but not entirely! hence the bugs! how a local clone works. Its fast, it's efficient). One result is that if you have a repo, A and its clone B, you can get at the data by doing either A/hash or B/hash. Data doesn't belong to A or B, it's all shared. It's only the file tracking branches that's the same. It shares the same underlying data (and since that data is immutable, there's no problem).
  It's also secure (sort of!). Let's say you make a new commit on B B/newhash and that's private. Well, in theory you could A/newhash to get at it, but it's a crypographically secure hash so it's as unguessable as B's password. If you share the hash with A, then A can access it. But there's no need for access control beyond the power of SHA-256.
  And here's the bug/problem. Under this model, the access cannot be revoked. Once you have the hash, you have it. It cannot be revoked because it is its own password. Deleting that file that tracks branch names doesn't delete the underlying data, and this is where github likely has a problem. I suspect they don't track which forks have a reference to that hash: internally git doesn't either. You can have orphaned hashes which are never referred to. You have to walk the whole tree in order to find them, and that's what "git gc" does. Walks the tree, notes everything that's accessible via the list of branches and deletes anything that isn't.
  Github need to do that and take the union over every fork, so I suspect they don't or do it occasionally, things tha
  Read the rest of this comment...
  
What ever you put in the hands of others! (Score:3)

by oldgraybeard ( 2939809 ) writes: on Friday July 26, 2024 @06:19PM (#64658714)

Is theirs to to do what ever they want with!

git gc (Score:2)

by allo ( 1728082 ) writes:

Every git repo retains commits until the next run of the garbage collector. And running it is relatively expensive, probably on Github's infrastructure (I don't think they just have your repo just lying on a webserver ...) even more.
you know Microsoft owns github now (Score:3)

by FudRucker ( 866063 ) writes: on Friday July 26, 2024 @06:47PM (#64658770)

and now you know why windows keeps getting bigger, everything that gets deleted from github goes right into windows

Two complaints (Score:2)

by bill_mcgonigle ( 4333 ) * writes:

They make two separate complaints.
One is valid, the other is nonsense.
Comments are split here focusing on one or the other. There are two separate issues.
Deletion is all in your head. (Score:1)

by dowhileor ( 7796472 ) writes:

I have always believed that nothing ever, ever is really deleted in a computer. Likely the first articles on technology i remember reading were on this topic which were actually entropy concepts and arguments about how fundementally useless file manipulation commands like copy and delete and maybe the entire command tree were.
- Re: (Score:2)
  
  by HiThere ( 15173 ) writes:
  
  Well, there are techniques that overwrite the original file data with zeros before deleting the pointers to them. Whether that works they way you think or not depends on how write is implemented. And reformatting partitions can work that way...but it's a lot slower.
Not news (Score:4, Informative)

by LinuxRulz ( 678500 ) writes: on Friday July 26, 2024 @10:41PM (#64659078)

This was already raised years ago. They (github) did add a banner about missing refs, but never really fixed the issue.
I remember having a good chuckle seeing this one:
https://github.com/torvalds/li... [github.com]

use BFG (Score:2)

by gabrieltss ( 64078 ) writes:

I use BFG repo cleaner and it wipes ALL reminants of secrets of any kind in a repository including in git history.

https://github.com/rtyley/bfg-... [github.com]

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Data From Deleted GitHub Repos May Not Actually Be Deleted, Researchers Claim (theregister.com) 23

Data From Deleted GitHub Repos May Not Actually Be Deleted, Researchers Claim More Login

Data From Deleted GitHub Repos May Not Actually Be Deleted, Researchers Claim

Security "researcher" discovers how git works... (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Security "researcher" discovers how git works.. (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Uh, Github isn't wrong (Score:4, Insightful)

Re: (Score:2)

Re:Uh, Github isn't wrong (Score:5, Insightful)

What ever you put in the hands of others! (Score:3)

git gc (Score:2)

you know Microsoft owns github now (Score:3)

Two complaints (Score:2)

Deletion is all in your head. (Score:1)

Re: (Score:2)

Not news (Score:4, Informative)

use BFG (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot