RIAA Tracking Songs by MD5 Hashes 779
aSiTiC writes "Apparently RIAA has obtained some technical experts in their prosecution of file swappers. Currently they are tracking traded mp3 files from the Napster network by matching MD5 hashes. This seems quite interesting but I was under the assumption that identical hashes could be created with identical rips and id3v2 tagging. Now may be the time to update your illegal mp3 file MD5 hash sums."
MD5 Cannot stand up in court. (Score:5, Informative)
MD5 Hash (Score:5, Informative)
The only way for two files to have the same MD5 hash is for them to both be encoded with the same encoder, from the same WAV file, with the same bitrate and all advanced options, and to have exactly the same ID3 information, the same filesize, and to be identical to the last bit.
Otherwise, the MD5 will be nothing like the same, for two perfectly identical songs where one has a spelling error in one field of the ID3 tag. I imagine for any one song, there are many many different MD5sums out there, although perhaps one or another good quality version would exists on hundreds of different PCs...
Md5 hashes are also used for.... (Score:5, Informative)
Changing MD5 hashes on songs to avoid RIAA would also lessen the effectiveness of K-SIG. Trading hashes of know working files was one of the ways ppl on P2p avoided downloading those fake RIAA files.
Comment removed (Score:5, Informative)
Re:MD5 Cannot stand up in court. (Score:2, Informative)
Re:What happen if (Score:5, Informative)
On the other hand, if I were the RIAA attempting to identify common files in this way, I might be inclined to exclude the ID3 tag from the MD5 computation since it is so easily modified.
Any changes to the actual content, though, will ripple into the MD5 computation.
Short answer: "normalizing" the file for volume, or even chopping off a few seconds of trailing silence with something like CoolEdit will certainly change the hash and make it distinct from whatever their baseline hash value is.
Easy (Score:5, Informative)
The only problem is that a lot of file sharing software uses the fact that 2 files (from different sources) have the same hash in order to swarm the download from multiple sources. If everybody goes around intentionally making their mp3s have different hashes, swarming basically won't work anymore.
Comment removed (Score:5, Informative)
Re:Time for a new WinAMP Plug-in (Score:3, Informative)
If the hash is using ID3 tags, you could change some unused field in there, but there would be a much smaller number of permutations available (although probelby still enough to be useful)
MD5 sums and different encoders (Score:5, Informative)
First step: the *.wav is ripped. Using libcdparanoia, which i personally perfer, i find slight variation in size depending on the machine and cdrom drive i rip them on.
Second step: encoding on different machines, with different encoders, using different algorythms, using different levels of floating point precision, on different architectures etc... produces vastly different files.
Third step: sharing. Oftentimes an mp3 is downloaded 99.8% before the connection is broken. You keep the mp3 becuase mp3 is a sequential file format and you only lose a second or two of music. The rest of the file is intact.
Their md5 searching scheme could be circumvented quite easily by changing a comment in the id3 but they could get around that by cutting out the id3 part of the file when they make their md5sum.
The downside to this is that if you are searching for music on something like gnutella by the ***sum, the content would differ and you would not get as many results. Gnutella would not download from multiple sources becuase the file would not have the same signature.
Whatever the case, it is clear that some form of file obfuscation is now needed for safety online. Or we can wait for freenet to mature.
Re:Excuse my ignorance (Score:2, Informative)
If you download the question, you can check that the solution matches the expected solution. If so, the download is good.
Note, this is a very simplified version, using a pretty poor analogy. I'm sure there's a website that explains this better.
Re:MD5 Cannot stand up in court. (Score:5, Informative)
First of all it's very clear that two files can give same MD5 checksums. After all, MD5 is only 16 bytes (2^128 different possible). So if you have just 17 byte files (2^136 different possible), it's clear that on average every MD5 sum matches to 256 of all possible files.
It's just damn unlikely to get 2 files with same MD5, and if you wanted to brute force it, you would have to try average 2^64 different files before you found one with identical MD5 to another file. And this would take a long time (actually not that terribly long, a few years at most, and it parallelizes perfectly).
The page you link to implies that it's possible to "easily" fabricate a file that produces a given check sum, so instead of months of processing time, only days or hours would be needed to get a MD5 hash collision.
So all P2P users / software makers need to do to circumvent this, is to agree on a specific MD5 sum, then patch every file so that they produce this same MD5 sum
Of course the obivious solution for RIAA would be to use a more secure hash algorithm, with more bits. Unbroken algorithm with enough bits can't be faked, as it would take more than age of the universe to brute force it.
Though the basic problem with this RIAA method remains. If you rip with same software from identical CD digitally, and there are not bit errors at ay point, then you should end up with identical file, and therefore identical hash no matter how secure the algorithm is...
Re:MD5-hashes (Score:0, Informative)
Re:Excuse my ignorance (Score:2, Informative)
You're right in that it is possible to have the same MD5 sum for multiple files, but the chances of it happening is extremely small for two reasons.
The first reason is that MD5 has 128 bits to describe the file, meaning that there is a 1 in 2^128 chance that any given random bitstream will have the same MD5 sum (Of course, MP3s aren't all that random in portions of the file format, but the basic argument still stands).
The second reason is the very process of verification. In order to verify a file, you must already have a checksum of the original file to compare it to, and you have a file which you think could be the same file, meaning file names and file sizes are already identical. If those files differ by as much as one bit, then they will produce different checksums. If you're willing to try to match a file named "ISO of Windows XP" with a file size of 650.1MB versus a file named "ISO of Mandrake" with a file size of 643.8MB then you're already sure that they're not the same file by the filesize alone.
In short, possible, but extremely unlikely.
Similar story on BBC (Score:3, Informative)
Re:MD5 Hash (Score:3, Informative)
No!! That's definately not true. Making a perfect rip is something you have to WORK at, which not many rippers do. Especially years ago. Check out ChrisMyDen's Uber Network [chrismyden.com] on a detailed guide on how to make the 'perfect mp3'.
You need to use something like EAC's secure mode. It rips the cd twice and compares for exactness. Only then can you be assured your wav file has no errors.
Even if you can convince people to use the best mp3 encoding techniques (LAME 3.92 or LAME 3.90.2 -aps) I have still seen people refuse to use EAC, instead enjoying cdex, audiograbber, or (gasp) jukebox due to 'ease of use'. These ripper DO NOT make perfect rips, and will almost always make a different wav file each time due to the way it tries to make error corrections. Most people will not ditch their source either, even if there are errors. And everyone has a different scratch on their cd's.
Almost everyone encodes at 128kbps
This isn't true anymore either. Considering most of the lazy people out there download mp3's instead of make their mp3's, many of the rippers today do care about quality, and will rip in VBR or at 192. Release groups (where I would imagine most of the new stuff originates nowadays will rip at 192, 224, 256, or 320)
Re:MD5-hashes (Score:5, Informative)
Uh, actually this is irrefutable proof. It will miss a lot of songs, but it is virtually guaranteed to not give false positives. This is much more solid proof than SCO had.
To think a month or two ago when SCO was insisting on an NDA many on
Obviously the RIAA's technical experts know what they are doing... its time to alter a few ID3 tags like the story suggested.
Re:gee? (Score:1, Informative)
Re:MD5 Hash (Score:2, Informative)
Re:What if... (Score:4, Informative)
Actually that's not true. They only care about the sharing because it leads to what they really care about: people listening to music that they didn't pay for. If everyone who shared mp3s had bought every CD of the songs they downloaded, no one would care because they would have already paid to listen to those songs. The problem is that most people don't own all of the CDs for the songs they download, and the RIAA doesn't like it when you try to wriggle out of their money trap. If the actual sharing was the problem, the distribution itself, then we wouldn't have radio stations playing music either, because that also lets people listen to music they didn't pay for, but it's a bit different because you don't really get a choice of what you hear. But now if you go and start recording songs you hear on the radio, so you could listen to them whenever you wanted, you're getting into that grey area. Of course the RIAA doesn't really care about that because they know that radio quality is shit, so there won't be widespread radio recording anyway.
How RIAA tracks downloaders (Score:3, Informative)
(Music industry discloses some methods used)
Re:What happen if (Score:5, Informative)
If that's all you want to do, much better not to use Cooledit, which has to expand and recompress the file to MP3. Use something like MP3Trim [logiccell.com] which can chop off any given number of MP3 frames, or normalise the volume, by operating on the MP3 directly. Much much faster, and no expand/recompress quality loss.
Re:MD5-hashes (Score:5, Informative)
I did the same song three times. The first two times, all things were equal including all settings. The MD5 checksums were the same.
I swapped out my DVD/CD player for a different model. Reripped the track on the same computer with the same exact settings and the MD5 was different.
I am using Exact Audio Copy in secure mode and Lame for the encoding. The ID tags were recieved the first time and the same tags used for all three attempts (EAC remembers the disk).
I'm sure I could try many things like changing the read speed, comparing the wav files and not just the resulting mp3 etc.. but I do not have the time for more analysis.
Music Hashing with musicbrainz (Score:3, Informative)
I've compared albums I've ripped myself to the database and gotten "100%" matches (along with some matches of a much lower percentage) That leads me to think that if the RIAA kept its own database like that, they could do a whole lot of comparison with similarity or quasi-unique (ala MD5) hashes. I'd also venture that, with enough work at the comparison system, they could make court-valid assertions. They can hire plenty of geeks to handle the statistics necessary to call something 'beyond a reasonable doubt.' (for criminal proof)
Re:What if... (Score:3, Informative)
Now here is where it gets good - the downfall of mp3.com was exactly because of sharing. They put together a system where you could buy a CD online, have it shipped to you, but also immediately have it available online as an MP3 through a password protected account that only allowed a single simultaneous user. They also provided a method to "upload" your previously purchased CDs - you stuck your CD in your cd-rom drive and ran their program that verified that the CD had the same contents as the released one (so either you had a legit copy or a perfect rip&dupe, either way you *already* had the music) and then that disc was also made available in your private mp3.com account.
The RIAA freaked and sued and won. They won on the premise that mp3.com was making copies without permission (from the RIAA) and then sharing them. Never mind that the only people who had access where those who had proven they already owned the music to begin with. They won big too, something like $25M per RIAA member company. That used up a *lot* of VC and IPO cash.
Re:gee? (Score:5, Informative)
Re:MD5-hashes (Score:4, Informative)
Theres issues of offset values (as with CD audio it is difficult to hit an *exact* location on the disk), plus the way the reader deals with C1 and C2 error correction, as well as how different extracting software interfaces with the hardware.
It would almost be safe to say two mp3s with the the same MD5 are one file copied twice (as opposed to two individually created mp3s), but that doesn't mean they are illegal...
Re:gee? (Score:3, Informative)
For loose definitions of "fairy", yes. eg child, friend, etc
>> "The only way that the MD5 hashes could be identical is if the two files are absolutely identical in every single bit."
Try the following: Install some CD ripping/encoding software. Leave it at the defaults. Use CDDB to generate the ID3 tags. Unless something gets corrupted, that *will* produce an identical file, down to the last bit.
Nowhere in that article do they mention MD5 (Score:2, Informative)
This is what they mean when they say hash. NOT md5. Obviously MD5 could not track an mp3, since changing even one character in the ID3 tag would change the whole hash.
So they probably have an automated downloader that then generates a fingerprint from the downloaded file and compares it to a db of fingerprints to determine if the song is copyrighted. I'd bet that's all.
Re:MD5-hashes (Score:2, Informative)
The "offest".
If you use EAC you will see there is a tab where you can correct your drive's offset value.
Now if you do that (or atleast 'sync' them) you should get the same result on both drives if the disc is good enough. (Ofcourse all your other settings should be set properly too) (If your disc is bad, EAC can correct those errors by re-reading a dozen times and then using the most often occuring result, but if your disc is a little too bad on a specific part, EAC won't be able to return the same result each read)).
I know this because I have ripped discs on *three* diffrent cd-roms one 2x old HP burner, one el cheapo 36x drive and a toshiba laptop drive (also a burner).
Granted I compared wave files, but I guess that if you feed the same wave file to the same encoder with the same settings you should get the exact same result.
note:
Offset: When your cd-rom reads a position on the disc in audio mode it often misreads, ie say you tell it to read position 0, then it will read position 4. Normally this doesn't matter since offsets are measured in milliseconds so you won't hear a diffrence, but for ripping bit-perfect rips, it does matter.
You ccorrect it by finding out what offset your particular cd-drive has (every particular model number has a particular offset, few drives that are of the same brand and model have diffrent offsets)
What I mean by 'syncing' is not correcting the offset but making it the same between drives.
For example, burn a offset cd in EAC (use a cd-rw if you must). this disc will have the same offset of your cd-WRITER.
Now 'correct' the offset in all your drives (including your burner, 'cause burners have a diffrent offset when writing than reading) with this disc.
It won't be perfect, since now all your drives have the same offset, namely the write offset of your cd-burner.
BUT now the rips will be identical, since they will all have the same offset.
NOTE: I think the RIAA doesn't hash the ID3 tags, only the music.
That way the same mp3 with diffrent ID3 tags will still be identified as being the same.
Thats btw what Kazaa does if i'm not mistaken.
Re:gee? (Score:2, Informative)
-fp
Re:Lost in a Fire? (Score:1, Informative)
Re:gee? (Score:3, Informative)
I wouldn't expect two different WAV's that sound exactly the same to give the same mp3. But I wouldn't have bothered to test it either.
As I think about it, your theory is interesting. Since mp3 compression is based on the perception of audio, or getting rid of everything that you don't perceive, then there is some argument that two very similar WAV bit patterns that sound identical might actually be closer after encoding to mp3 than you might think. Of course an MD5 hash of the two mp3's is not a good indicator of this, as one single bit difference in two files radically alters the MD5 hash.
Re:MD5 hash "posers" (Score:4, Informative)
If that were possible, it would destroy the value of an MD5 hash immediately and everyone wouild quit using it faster than you could blink.
The purpose of CRC hashes is entirely different. They are designed to detect a burst of bit errors in a stream of data, the type of error that is most likely to occur in a network transmission. They are not meant for fingerprinting files.
I doubt that anyone with any degree of sophistication in cryptology would attempt to use CRC and MD5 hashes interchangeably.