Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Privacy

Data From Deleted GitHub Repos May Not Actually Be Deleted, Researchers Claim (theregister.com) 23

Thomas Claburn reports via The Register: Researchers at Truffle Security have found, or arguably rediscovered, that data from deleted GitHub repositories (public or private) and from deleted copies (forks) of repositories isn't necessarily deleted. Joe Leon, a security researcher with the outfit, said in an advisory on Wednesday that being able to access deleted repo data -- such as APIs keys -- represents a security risk. And he proposed a new term to describe the alleged vulnerability: Cross Fork Object Reference (CFOR). "A CFOR vulnerability occurs when one repository fork can access sensitive data from another fork (including data from private and deleted forks)," Leon explained.

For example, the firm showed how one can fork a repository, commit data to it, delete the fork, and then access the supposedly deleted commit data via the original repository. The researchers also created a repo, forked it, and showed how data not synced with the fork continues to be accessible through the fork after the original repo is deleted. You can watch that particular demo [here].

According to Leon, this scenario came up last week with the submission of a critical vulnerability report to a major technology company involving a private key for an employee GitHub account that had broad access across the organization. The key had been publicly committed to a GitHub repository. Upon learning of the blunder, the tech biz nuked the repo thinking that would take care of the leak. "They immediately deleted the repository, but since it had been forked, I could still access the commit containing the sensitive data via a fork, despite the fork never syncing with the original 'upstream' repository," Leon explained. Leon added that after reviewing three widely forked public repos from large AI companies, Truffle Security researchers found 40 valid API keys from deleted forks.
GitHub said it considers this situation a feature, not a bug: "GitHub is committed to investigating reported security issues. We are aware of this report and have validated that this is expected and documented behavior inherent to how fork networks work. You can read more about how deleting or changing visibility affects repository forks in our [documentation]."

Truffle Security argues that they should reconsider their position "because the average user expects there to be a distinction between public and private repos in terms of data security, which isn't always true," reports The Register. "And there's also the expectation that the act of deletion should remove commit data, which again has been shown to not always be the case."
This discussion has been archived. No new comments can be posted.

Data From Deleted GitHub Repos May Not Actually Be Deleted, Researchers Claim

Comments Filter:
  • by gweihir ( 88907 ) on Friday July 26, 2024 @05:12PM (#64658686)

    Seriously. Yes, this may cause problems, but the very idea of git and any version control system, really, is that you can always access every earlier commit unless you nuke everything. In the case of Git, which is a _distributed_ version control system, unless you delete _all_ copies of the repository, you cannot reliably delete anything. This is indeed working as expected and is no surprise at all.

    Now, what about API keys? Simple: Same as secret keys, passwords, etc. they have _zero_ business getting checked into a version control system. They belong into a key management system and nowhere else. So what do you do when you have committed a cryptographic secret to a version control system? Also simple: You must invalidate and change it, no exceptions.

    • Gah. Wait till they find out blockchains maintain 100% transaction history that can be traced.

      Sigh.

      There is no deleting history in git, unless you can get access to *every* clone ever made.

      Good luck.

      • by gweihir ( 88907 )

        Gah. Wait till they find out blockchains maintain 100% transaction history that can be traced.

        And worse, that basically everybody can insert stuff into public blockchains, often anonymously.

    • is that you can always access every earlier commit unless you nuke everything.

      If I fork your repo and only commit to my repo, delete my repo and never push commits to yours, how is my commits showing up in your repo how git works?

    • by test321 ( 8891681 ) on Friday July 26, 2024 @05:34PM (#64658754)

      From the summary (and from the video), one of their experiment is to first create a fork of a repo, commit something, then delete the fork. There was never a pull request, the data in the deleted fork was never merged into the original repo. It should not be accessible. But if you know the commit ID, it is somehow still accessible. This is not a feature of git. It is a feature of GitHub website owners not implementing deletion as "rm -rf" on the fork folder, but as hiding the holder from the view, but still letting the data leak if you know an exact path (through the commit ID).

      • Which is probably part of some backup mechanism for protection against malicious commits / or accidental deletion of a private repo.

        I.e. If you do the stupid thing of mindlessly clicking through all of the warnings and delete the repo, then decide you want it back. Github instantly performing "rm -rf" on the repo would mean that repo is irretrievable. Which is what you said you wanted, but being a corporate entity in the US means that lawsuits are inevitable if Github were to tell you that outright. Just
        • 1) "rm -rf" was just an rhetoric example. I did not really expect them to implement github as a shell scripts. But even if we keep the simple script analogy, they certainly could "move the folder to the trash bin" (recoverable for some time) which would still make the file structure inaccessible, as one would expect.

          being a corporate entity in the US means that lawsuits are inevitable

          The damage caused by accidental deletion is zero. You obviously have a copy of your github repo on your local computers, you just need to push it to github again. Github certainly has an elabora

      • by gweihir ( 88907 )

        If you expect reliable deletion from a version control system, then you are doing it very wrong. It is contrary to the very nature of the thing and can never be more than a hack. Same, incidentally, if you place cryptographic secrets in any kind of commit in a version control system. Stop arguing and start using the right tools for the job.

    • by ssdfl ( 123261 )

      Any security company who thinks a good security model is to tell the accidental publisher of API keys to stop publishing them rather than fix the idiot who gave them to the publisher needs to have its reputation ruined. This is not a security model.

      Github should extract all the API keys and put them on a web page with the heading "If your API key is in this list, you need to invalidate it and change it."

      • by gweihir ( 88907 )

        Indeed. Doing this is a rather striking proof of incompetence and non-working processes. API keys have no place in a version control system (in regulated environments it may even be illegal to put them there), and once you have put one in there, the only sane course of action is to invalidate them immediately and change them in all systems. And that applies to test-keys as well, since they can end up working in production and data in test environments may still be sensitive. API keys belong into special con

  • by Yoda's Mum ( 608299 ) on Friday July 26, 2024 @05:18PM (#64658708)

    I hate to agree with Github, but the entire point of a fork is that it's a fully separate copy. Commits in the parent repository messing with forks breaks most of the reasons for them in the first place.

    • Except that's not it, Commits in the fork show up in the parent also even if the commits were never pushed there.

    • by serviscope_minor ( 664417 ) on Saturday July 27, 2024 @06:04AM (#64659542) Journal

      I hate to agree with Github, but the entire point of a fork is that it's a fully separate copy.

      I don't think that's going on, and you can kind of see from the bugs something about github's implementation. Strap in, like everything in git, the abstractions leak out all over the place :) And this is a simplified explanation.

      If you don't understand git's internals and haven't read this doc, I would encourage you to do so, it's excellent:

      https://tom.preston-werner.com... [preston-werner.com]

      Basically in git everything is referred to by hashes. Files contents are hashed and referred to by that hash. A tree a text file which is basically a list of filenames along with hashes: that's how it knows what data each file contains. A commit is a text file, too. It contains a list of 0 or more parent commits (referred to by hash), a tree (referred to by hash) and metadata like a message. The entire lot is hashed. So a commit is a hash too.

      The point of all of that is that each of those things is completely immutable and cannot be changed. If you change something, then all of the hashes change and you get a new file, new tree and new commit.

      The mutable part of git is the branch references. This is again a list that says things like "main da765678af6e6ae76e092", and that's how you know what commit main is on. If you do a git commit, you generate a new commit (i.e. hash) and then update the branch so that it says main is the new hash now.

      So what?

      Well typically when you clone a repo, you pull down all the immutable data (i.e. blobs of data with their hash) and you might get a list of branches along with their hashes. Often the branch names are modified from. If the branch name is "foo" and you cloned from a server called "origin", in your copy, the branch will be called "remotes/origin/foo". But if you clone a local repository, it won't download anything, it just makes hardlinks to the blobs of data back in the original repository?

      Why? Well that data is immutable, so why copy it?

      All a repo clone (fork in github parlance) is is a private copy of that file which tracks branch names and some way of downloading the data by hash (and maybe storing new data by hash as well).

      I think what github are doing is only ever having a single collection of blobs-with-hashes. When you fork, you get a new file which tracks branches, but the same collection of blobs-with-hashes. This makes sense, it's mostly (but not entirely! hence the bugs! how a local clone works. Its fast, it's efficient). One result is that if you have a repo, A and its clone B, you can get at the data by doing either A/hash or B/hash. Data doesn't belong to A or B, it's all shared. It's only the file tracking branches that's the same. It shares the same underlying data (and since that data is immutable, there's no problem).

      It's also secure (sort of!). Let's say you make a new commit on B B/newhash and that's private. Well, in theory you could A/newhash to get at it, but it's a crypographically secure hash so it's as unguessable as B's password. If you share the hash with A, then A can access it. But there's no need for access control beyond the power of SHA-256.

      And here's the bug/problem. Under this model, the access cannot be revoked. Once you have the hash, you have it. It cannot be revoked because it is its own password. Deleting that file that tracks branch names doesn't delete the underlying data, and this is where github likely has a problem. I suspect they don't track which forks have a reference to that hash: internally git doesn't either. You can have orphaned hashes which are never referred to. You have to walk the whole tree in order to find them, and that's what "git gc" does. Walks the tree, notes everything that's accessible via the list of branches and deletes anything that isn't.

      Github need to do that and take the union over every fork, so I suspect they don't or do it occasionally, things tha

  • by oldgraybeard ( 2939809 ) on Friday July 26, 2024 @05:19PM (#64658714)
    Is theirs to to do what ever they want with!
  • Every git repo retains commits until the next run of the garbage collector. And running it is relatively expensive, probably on Github's infrastructure (I don't think they just have your repo just lying on a webserver ...) even more.

  • by FudRucker ( 866063 ) on Friday July 26, 2024 @05:47PM (#64658770)
    and now you know why windows keeps getting bigger, everything that gets deleted from github goes right into windows
  • They make two separate complaints.

    One is valid, the other is nonsense.

    Comments are split here focusing on one or the other. There are two separate issues.

  • I have always believed that nothing ever, ever is really deleted in a computer. Likely the first articles on technology i remember reading were on this topic which were actually entropy concepts and arguments about how fundementally useless file manipulation commands like copy and delete and maybe the entire command tree were.

    • by HiThere ( 15173 )

      Well, there are techniques that overwrite the original file data with zeros before deleting the pointers to them. Whether that works they way you think or not depends on how write is implemented. And reformatting partitions can work that way...but it's a lot slower.

  • Not news (Score:4, Informative)

    by LinuxRulz ( 678500 ) on Friday July 26, 2024 @09:41PM (#64659078)

    This was already raised years ago. They (github) did add a banner about missing refs, but never really fixed the issue.
    I remember having a good chuckle seeing this one:
    https://github.com/torvalds/li... [github.com]

  • I use BFG repo cleaner and it wipes ALL reminants of secrets of any kind in a repository including in git history.

    https://github.com/rtyley/bfg-... [github.com]

"...a most excellent barbarian ... Genghis Kahn!" -- _Bill And Ted's Excellent Adventure_

Working...