Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Copyright Tool Scans Web For Violations

Posted by Zonk on Tue Dec 19, 2006 11:31 AM
from the he-knows-when-you've-been-bad-or-good dept.
The Wall Street Journal is reporting on a tech start-up that proposes to offer the ultimate in assurance for content owners. Attributor Corporation is going to offer clients the ability to scan the web for their own intellectual property. The article touches on previous use of techniques like DRM and in-house staff searches, and the limited usefulness of both. They specifically cite the pending legal actions against companies like YouTube, and wonder about what their attitude will be towards initiatives like this. From the article: "Attributor analyzes the content of clients, who could range from individuals to big media companies, using a technique known as 'digital fingerprinting,' which determines unique and identifying characteristics of content. It uses these digital fingerprints to search its index of the Web for the content. The company claims to be able to spot a customer's content based on the appearance of as little as a few sentences of text or a few seconds of audio or video. It will provide customers with alerts and a dashboard of identified uses of their content on the Web and the context in which it is used. The content owners can then try to negotiate revenue from whoever is using it or request that it be taken down. In some cases, they may decide the content is being used fairly or to acceptable promotional ends. Attributor plans to help automate the interaction between content owners and those using their content on the Web, though it declines to specify how."
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Wager (Score:4, Insightful)

    by Baricom (763970) on Tuesday December 19 2006, @11:33AM (#17300868)
    Anybody care to place a friendly wager that they're not going to honor robots.txt?
    • Raise. (Score:4, Funny)

      by Tackhead (54550) on Tuesday December 19 2006, @11:44AM (#17301006)
      > Anybody care to place a friendly wager that they're not going to honor robots.txt?

      127.0.0.1: $ cat robots.txt
      # robots.txt for 127.0.0.1
      # This file is copyright 2006 by me.
      User-agent: AttributorCorporationDMCABot
      Disallow: *

      And if they do honor robots.txt, I'll be able to sue the fuckers for infringing on my copyright, because they must have read it in order to honor it.

      [ Parent ]
      • Re:Raise. by Hijacked Public (Score:2) Tuesday December 19 2006, @11:47AM
      • Re:Raise. by rhartness (Score:2) Tuesday December 19 2006, @12:00PM
        • 1 reply beneath your current threshold.
      • Re:Raise. by commodoresloat (Score:2) Tuesday December 19 2006, @12:01PM
        • Re:Raise. by advocate_one (Score:2) Tuesday December 19 2006, @12:15PM
          • Re:Raise. by civilizedINTENSITY (Score:2) Tuesday December 19 2006, @01:00PM
          • Re:Raise. by Da_Weasel (Score:2) Tuesday December 19 2006, @04:14PM
      • Re:Raise. (Score:5, Funny)

        by Mayhem178 (920970) on Tuesday December 19 2006, @12:12PM (#17301304)
        127.0.0.1: $ cat robots.txt
        # robots.txt for 127.0.0.1
        # This file is copyright 2006 by me.
        User-agent: AttributorCorporationDMCABot
        Disallow: *


        Hahaha! You screwed up! I have your IP address now! I will send 127.0.0.1 to every company that uses the sniffer and tell them the person at that IP is an evil, evil person who exploits innocent people for their own profit and power!
        [ Parent ]
        • Re:Raise. by eosp (Score:1) Tuesday December 19 2006, @12:54PM
        • His IP is my IP to by Anonymous Coward (Score:2) Tuesday December 19 2006, @02:02PM
        • Re:Raise. by PPH (Score:1) Tuesday December 19 2006, @04:10PM
      • Re:Raise. (Score:4, Interesting)

        by FooAtWFU (699187) on Tuesday December 19 2006, @12:32PM (#17301496)
        (http://fennecfoxen.org/)
        You joke, of course, of course, but there are tools out there to detect when a bot is abusing your site and not following robots.txt. The usual technique is to hide a few links in your page, and also have these links blocked by robots.txt. When a user visits the link, they're banned from viewing the site. (Sometimes, a CAPTCHA-like utility for unblocking yourself is presented along with the 403 page, in the event that a particularly curious user manages to find the link and activate it manually.)
        [ Parent ]
        • Re:Raise. by Kamiza Ikioi (Score:3) Tuesday December 19 2006, @05:56PM
          • 1 reply beneath your current threshold.
      • Buzzkill by edraven (Score:1) Tuesday December 19 2006, @12:41PM
      • Re:Raise. by PPH (Score:1) Tuesday December 19 2006, @03:57PM
      • Re:robots.txt can be bypassed. .htaccess may not by CantStopDancing (Score:1) Tuesday December 19 2006, @02:29PM
      • 2 replies beneath your current threshold.
    • Re:Wager by Crudely_Indecent (Score:3) Tuesday December 19 2006, @12:35PM
      • Re:Wager by BrynM (Score:3) Tuesday December 19 2006, @11:50PM
    • Re:Wager by PalmKiller (Score:2) Tuesday December 19 2006, @12:52PM
    • Would they turn to hacking? by Tarinth (Score:1) Tuesday December 19 2006, @12:54PM
    • Re:Wager by antarctican (Score:2) Tuesday December 19 2006, @02:03PM
      • 1 reply beneath your current threshold.
    • Re:i don't like robots.txt anyway. (Score:5, Informative)

      by FooAtWFU (699187) on Tuesday December 19 2006, @11:57AM (#17301150)
      (http://fennecfoxen.org/)
      You're absolutely right that "if you don't want it on the public Web, don't put it there in the first place" -- but there are still times when you have a legitimate reason that you don't want a page indexed, downloaded, or otherwise visited by a robot. Dynamically generated content is one example reason; sometimes certain pages can be a big drain on your website, and you'd prefer not to have every spider in the world hitting them up every few minutes.

      Let's take a fun legitimate site like, oh... Wikipedia [wikipedia.org]:

      # Folks get annoyed when VfD discussions end up the number 1 google hit for
      # their name. See bugzilla bug #4776
      # en:
      Disallow: /wiki/Wikipedia:Articles_for_deletion/
      Disallow: /wiki/Wikipedia%3AArticles_for_deletion/
      Disallow : /wiki/Wikipedia:Votes_for_deletion/
      Disallow: /wiki/Wikipedia%3AVotes_for_deletion/
      Disallow: /wiki/Wikipedia:Pages_for_deletion/
      Disallow: /wiki/Wikipedia%3APages_for_deletion/
      Disallow: /wiki/Wikipedia:Miscellany_for_deletion/
      Disallow : /wiki/Wikipedia%3AMiscellany_for_deletion/
      Disall ow: /wiki/Wikipedia:Miscellaneous_deletion/
      Disallow: /wiki/Wikipedia%3AMiscellaneous_deletion/
      Disallo w: /wiki/Wikipedia:Copyright_problems
      Disallow: /wiki/Wikipedia%3ACopyright_problems
      (They also disallow certain specially generated pages like Special:Random, and any of the pages which actually let you edit the site).

      Let's see, what are some other sites? Ooh. Take a look at Slashdot's robots.txt [slashdot.org]! (disallows a variety of fun pages.) Microsoft's? [microsoft.com] How about whitehouse.gov [whitehouse.gov]? Google [google.com]?

      [ Parent ]
      • Re:i don't like robots.txt anyway. by Anonymous Coward (Score:1) Tuesday December 19 2006, @12:56PM
      • Re:i don't like robots.txt anyway. (Score:5, Informative)

        by mandelbr0t (1015855) on Tuesday December 19 2006, @01:32PM (#17302496)
        (Last Journal: Thursday March 01 2007, @01:53PM)

        Dynamically generated content is one example reason; sometimes certain pages can be a big drain on your website

        And dynamic content is, of course, the answer. If I'm going to put up copyrighted content in the future, I'd use one of a dozen schemes that regenerate the download link on a per-session basis. Obviously they're not going to honour robots.txt, but why are your links readable by such a basic spider? You need to:

        1. Disallow anonymous downloads. You need to be logged onto the site to download anything, torrent or otherwise
        2. Use a CAPTCHA to prevent spiders from signing up for said accounts
        3. Use the session id to generate unique download links on a per-session basis
        4. Change the key on your BitTorrent tracker every 12-24 hours. This will require that a downloader get the latest torrent from the original website (which requires login), reducing the impact of a leaked torrent
        5. Compress and possibly encrypt the content so that it's less obvious what it is

        Anyone who follows the above steps (and most sites already do most or all of this) won't be found by the spider. Period.

        The only thing I can think of that this product would be useful for is to find people who have blatantly copied my website, but I'm sure you could find those people equally easily with Google.

        mandelbr0t

        [ Parent ]
      • Re:i don't like robots.txt anyway. by Anonymous Coward (Score:1) Tuesday December 19 2006, @04:03PM
        • 1 reply beneath your current threshold.
    • Re:i don't like robots.txt anyway. by Monoliath (Score:1) Tuesday December 19 2006, @12:59PM
    • Re:Wager by markana (Score:2) Tuesday December 19 2006, @01:33PM
    • Re:Public vs. Searchable. by Da_Weasel (Score:2) Tuesday December 19 2006, @04:08PM
    • 5 replies beneath your current threshold.
  • by LiquidCoooled (634315) on Tuesday December 19 2006, @11:33AM (#17300870)
    Can't they just use google or torrent sites?
    If users can find items they want, presumably the copyright holders could use the same methods...
  • Dupe by gravesb (Score:1) Tuesday December 19 2006, @11:34AM
    • Re:Dupe by xlordtyrantx (Score:1) Tuesday December 19 2006, @11:47AM
    • Re:Dupe by AKAImBatman (Score:3) Tuesday December 19 2006, @11:49AM
    • Re:Dupe (Score:4, Interesting)

      by Maximum Prophet (716608) on Tuesday December 19 2006, @11:53AM (#17301106)
      Since copyright lasts a long time and doesn't depend on being defended like trademark, there will be some allowances "for promotional reasons" like this:
      1. Leak copywritten material in easy to copy format to places where it will be copied
      2. Watch viral marketing campaign take over
      3. Profit
      4. Wait 'til revenue falls
      5. Find infringers using new scan tools
      6. Sue them
      7. Profit more!!!
      [ Parent ]
    • Re:Dupe by PTBarnum (Score:1) Tuesday December 19 2006, @12:01PM
    • A real use on /. by EmbeddedJanitor (Score:3) Tuesday December 19 2006, @12:53PM
  • buh (Score:5, Insightful)

    by lucky130 (267588) on Tuesday December 19 2006, @11:36AM (#17300910)
    "as little as a few sentences of text or a few seconds of audio or video"

    Like quotations in a paper, or video snippets in an educational presentation?
    • Re:buh by brouski (Score:1) Tuesday December 19 2006, @12:18PM
      • 1 reply beneath your current threshold.
    • Re:buh by silentounce (Score:1) Tuesday December 19 2006, @12:24PM
      • Re:buh (Score:5, Insightful)

        by NeutronCowboy (896098) on Tuesday December 19 2006, @12:38PM (#17301574)
        You're assuming anyone is going to manually verify any of the results. From my experience with people using monitoring software (especially non-techies who are simply consumers of the technology, but who provided the money for it), the vast majority of them are simply going to call their lawyers when they see the dashboard light up. I see vast letter writing campaigns come from this, with little actual infringing being prosecuted.

        This is a scary product. Not so much because of the technology behind it, but because of how it is going to be implemented and (ab)used.
        [ Parent ]
        • Re:buh by Reziac (Score:2) Tuesday December 19 2006, @02:08PM
        • Re:buh by grimJester (Score:2) Tuesday December 19 2006, @03:46PM
        • Re:buh by lucky130 (Score:1) Wednesday December 20 2006, @09:27AM
  • No fear ! by Rastignac (Score:1) Tuesday December 19 2006, @11:36AM
  • Spam obfuscation techniques suddenly useful... by scottsk (Score:2) Tuesday December 19 2006, @11:36AM
  • At least somebody knows it: by Veetox (Score:1) Tuesday December 19 2006, @11:37AM
    • Property? by Cybert4 (Score:1) Tuesday December 19 2006, @12:07PM
  • My first thought.. by FunWithKnives (Score:1) Tuesday December 19 2006, @11:40AM
  • Yeah.. good luck with that. by Rob T Firefly (Score:2) Tuesday December 19 2006, @11:42AM
    • 1 reply beneath your current threshold.
  • by TheWoozle (984500) on Tuesday December 19 2006, @11:42AM (#17300994)
    Doesn't this merely serve to point out the absurdity of "Intellectual Property"?
  • Yeah (Score:4, Interesting)

    by Hijacked Public (999535) on Tuesday December 19 2006, @11:45AM (#17301020)
    FTFA:

    If it works, it's a fantastic invention


    Its purpose aside, yes, it would be a fantastic thing to be able to scan the entire web and reliably identify the context and content of any specific media file type. Video, audio, image, etc. Particularly if it could identify purposely obfuscated content.

    I'm in what is almost certainly a tiny minority of Slashdotters in that I actually create copyrightable material rather than only consume it. I'm again in the minority in that I think copyrights are a good thing and again in the minority in that I can separate out the purpose of copyrights and the evil actions of the legal arms of **AA companies.

    Regardless, while scanning the internet for improperly used material sounds great on paper this will probably end up being as effective as finding water with a divining rod. The current tactic of locking down things at the hardware and OS levels will get more support from the media companies, not that they seem all that good at choosing tactics when the internet is involved.

    • Re:Yeah (Score:4, Insightful)

      by jedidiah (1196) on Tuesday December 19 2006, @11:59AM (#17301166)
      (http://penguin.lvcm.com/)
      There's a wide gulf between copyright being a good idea in concept and being sensibly implemented in it's current form.

      Not everyone that creates content thinks that draconian enforcement attempts are a good idea, or even in the best interests of those that create content.

      If your work can't survive in the marketplace, which includes the prospect of everyone on the planet getting to use it for free, then perhaps you should get some sort of more conventional day job.

      The difference between a game that sells 50K and one that sells 5 Million has nothing to do with DRM.
      [ Parent ]
    • Re:Yeah (Score:4, Interesting)

      by AdamKG (1004604) <slashdotNO@SPAMadamgomaa.com> on Tuesday December 19 2006, @12:17PM (#17301346)
      (http://adam.gomaa.us/)
      and again in the minority in that I can separate out the purpose of copyrights and the evil actions of the legal arms of **AA companies.
      Let's make one thing clear: the RIAA/MPAA lawsuits are not, in any way, shape, or form, an abuse, negative side of, misapplication or malicious use of Copyrights. They fulfill the role of Copyrights in the first place; they are the logical end result of a system that says citizens are allowed to distribute ideas (or expressions of ideas), then stop any further distribution of them.

      The **AA lawsuits are ridiculous, yes. But the ridiculous part is not the litigation itself, it's the laws on which the lawsuits are brought under.
      [ Parent ]
      • Re:Yeah by DeadChobi (Score:2) Tuesday December 19 2006, @01:08PM
    • Re:Yeah by kanweg (Score:3) Tuesday December 19 2006, @12:26PM
      • Re:Yeah by fatman22 (Score:2) Tuesday December 19 2006, @12:58PM
        • Re:Yeah by rohan972 (Score:1) Wednesday December 20 2006, @02:27AM
      • Re:Yeah by Hijacked Public (Score:2) Tuesday December 19 2006, @01:08PM
        • Re:Yeah by kanweg (Score:1) Tuesday December 19 2006, @04:17PM
        • Re:Yeah by rohan972 (Score:1) Wednesday December 20 2006, @02:36AM
      • Re:Yeah by grcumb (Score:2) Tuesday December 19 2006, @03:47PM
    • Re:Yeah by Laur (Score:2) Tuesday December 19 2006, @02:59PM
    • Re:Yeah by DamnStupidElf (Score:2) Tuesday December 19 2006, @03:25PM
    • Re:Yeah by Relic of the Future (Score:2) Tuesday December 19 2006, @03:37PM
    • Re:Yeah by teamhasnoi (Score:2) Tuesday December 19 2006, @01:42PM
    • 1 reply beneath your current threshold.
  • and in little pieces, they will consume bandwidth by way2trivial (Score:2) Tuesday December 19 2006, @11:46AM
  • Software is in beta (Score:3, Funny)

    by Weaselmancer (533834) on Tuesday December 19 2006, @11:46AM (#17301030)

    Attributor plans to help automate the interaction between content owners and those using their content on the Web, though it declines to specify how.

    And apparently being written by underpants gnomes.

  • Some interesting questions... (Score:5, Insightful)

    by PingSpike (947548) on Tuesday December 19 2006, @11:46AM (#17301032)
    Great, now all the torrent sites will require captcha verification too! ;P

    Actually, can they even scan torrents without downloading the entire file? And whats to stop everyone from just blocking them from accessing their websites? Are they going to go in covertly, pretending to be actual users? I can see every legit website blocking their access as well, why pay for bandwidth to supply that?

    Sure, youtube can be more efficiently attacked...but youtube has been dancing in front of the cannons since its inception, we all knew it was going to get shot eventually.
  • Dashboard by AVee (Score:2) Tuesday December 19 2006, @11:46AM
  • search by hash? (Score:4, Interesting)

    by straponego (521991) on Tuesday December 19 2006, @11:47AM (#17301040)
    Does Google allow searching by md5sum or equivalent? I'm sure they have the capability. While not as impressive as what this company claims, it'd also be more reliable for unaltered media files.

    But it looks like the real "innovation" these guys are pushing toward is fully automated filing of lawsuits. I think that was in Accelerando, which is fantastic, and which you can download it free. [accelerando.org]

    • Re:search by hash? (Score:4, Informative)

      by Johann Lau (1040920) on Tuesday December 19 2006, @12:45PM (#17301650)
      (http://johann-lau.de/)
      "Unaltered media files" are the exception, not the rule. Changing even a bit of metadata (stripping exif from an image, changing an mp3 tag) would change the checksum, not to mention things like putting things into an archive, resizing images, (re)recompressing music.

      But yeah, it might make sense for Google to become "aware" of unique content and variations of it.. but I doubt they'd ever use that openly for (aiding in) hunting down copyright infringement, simply for PR reasons.
      [ Parent ]
    • Re:search by hash? by sootman (Score:2) Tuesday December 19 2006, @01:18PM
  • Copying is great! by MarkByers (Score:2) Tuesday December 19 2006, @11:54AM
  • Negotiate Monitization? by eno2001 (Score:2) Tuesday December 19 2006, @11:55AM
  • It's just a tool by 91degrees (Score:2) Tuesday December 19 2006, @11:58AM
    • 1 reply beneath your current threshold.
  • Current Engines... by Neutari (Score:1) Tuesday December 19 2006, @12:00PM
  • Maybe it can work both ways by Anonymous Coward (Score:1) Tuesday December 19 2006, @12:01PM
  • Fair Use Issues by MrLizard (Score:2) Tuesday December 19 2006, @12:02PM
  • what's their probability of false alarm? by Anonymous Coward (Score:2) Tuesday December 19 2006, @12:05PM
  • Wait a minute by Billosaur (Score:2) Tuesday December 19 2006, @12:05PM
  • If you value your "property" so much... by Anonymous Coward (Score:2) Tuesday December 19 2006, @12:06PM
  • Scan Blocking by Daemonstar (Score:2) Tuesday December 19 2006, @12:06PM
  • Now SCO can continue... by filesiteguy (Score:2) Tuesday December 19 2006, @12:06PM
  • Finally an actual useful purpose for leet-speak? by kevintron (Score:1) Tuesday December 19 2006, @12:12PM
  • What a waste by j00r0m4nc3r (Score:2) Tuesday December 19 2006, @12:16PM
  • Evidence of a disease. by GodInHell (Score:2) Tuesday December 19 2006, @12:22PM
  • Copyright protection for the rich only. by John Sokol (Score:2) Tuesday December 19 2006, @12:23PM
  • Well, that's Ironic by cfulmer (Score:2) Tuesday December 19 2006, @12:24PM
  • I can see another use for this software by exp(pi*sqrt(163)) (Score:2) Tuesday December 19 2006, @12:28PM
  • Profit by future assassin (Score:1) Tuesday December 19 2006, @12:33PM
    • Re:Profit by future assassin (Score:1) Tuesday December 19 2006, @12:38PM
  • Whack-a-mole by SirGarlon (Score:1) Tuesday December 19 2006, @12:34PM
  • *sigh* by WWWWolf (Score:1) Tuesday December 19 2006, @12:35PM
  • Sounds like TurnItIn by Kelson (Score:2) Tuesday December 19 2006, @12:40PM
  • "...may decide the content is being use fairly..." by yar (Score:2) Tuesday December 19 2006, @12:45PM
  • How to detect your IP! by merc (Score:2) Tuesday December 19 2006, @12:52PM
  • Should scan for edu proper usage by WillAffleckUW (Score:1) Tuesday December 19 2006, @12:59PM
  • I've seen the code for their copyright scanner... by mmell (Score:1) Tuesday December 19 2006, @01:00PM
  • Countermeasure by hoggoth (Score:2) Tuesday December 19 2006, @01:46PM
  • CopyScape by rakerman (Score:2) Tuesday December 19 2006, @02:07PM
  • What concerns me: by botlrokit (Score:2) Tuesday December 19 2006, @02:28PM
  • They should be prepared... by reebmmm (Score:1) Tuesday December 19 2006, @02:32PM
  • Two words by kippers (Score:1) Tuesday December 19 2006, @03:01PM
  • They simply can not scan for subsets by teece (Score:1) Tuesday December 19 2006, @03:43PM
  • may decide content is fair use by Da_Weasel (Score:2) Tuesday December 19 2006, @04:19PM
  • I've experienced it from both sides. (Score:3, Informative)

    by bcrowell (177657) on Tuesday December 19 2006, @04:20PM (#17305144)
    (http://www.lightandmatter.com/)

    I've experienced this from both sides.

    I have a bunch of my books on the web, and every once in a while I do a search on some text from my own books to see who else is mirroring them. The books happen to be copylefted (dual-licensed GFDL/CC-BY-SA), but I'd like to know who's mirroring them, and check whether they're violating the license. A lot of people just seem to be hoarding the PDF files on their university servers, maybe because they're afraid my web site will disappear; that's flattering. One guy was selling them on CDs on e-bay, violating my license (claimed they were PD, didn't propagate the license). Another guy translated them to html, with lots of errors, changed the license to a more restrictive one, and put his own ads up; he fixed the licensing violation when I complained, and in a way it was a good thing, because it motivated me to make my own html versions (which are now bringing me a significant amount of money from adsense every month). One kind of annoying thing about mirroring is that the people who are mirroring never bother to update their mirrors, but in general I just figure there's no such thing as bad publicity :-)

    From the other side, I once received an e-mail from a museum in the UK that was complaining that I was using a 17th century oil painting of Isaac Newton. I guess they own the original, and they may also have been the ones who did the scan that I found in a google image search, but under U.S. law (Bridgeman Art Library, Ltd. v. Corel Corp.), a realistic reproduction of a PD two-dimensional art work is not copyrightable. What really surprised me was that they came across it at all, because at that time I think my book was only in PDF format, and hadn't been indexed by google because the file size was too big.

    The whole thing doesn't seem negative to me in general. It makes just as much sense as people doing a vanity search in Google before they apply for a job, or authors watching their amazon.com sales rankings obsessively. I guess the most obvious potential for abuse would be if they send a nastygram to your webhost, and your webhost is a low-end one that figures it's not worth their time to keep your account, so they just shut off your account.

  • robots.txt may be moot by MooseTick (Score:2) Tuesday December 19 2006, @05:53PM
  • Duplicate! There's a surprise! by Baloo Ursidae (Score:2) Tuesday December 19 2006, @10:24PM
  • A decent tool already exists by trawg (Score:2) Tuesday December 19 2006, @10:49PM
  • False positives!!! by Geofs (Score:1) Wednesday December 20 2006, @05:56AM
  • Hrmmm.... a new tool for piracy! by Amphetam1ne (Score:1) Wednesday December 20 2006, @09:20AM
  • Solution by d_54321 (Score:1) Tuesday January 02 2007, @01:15PM
  • Re:More reason to procure your warez... by Thraxen (Score:1) Tuesday December 19 2006, @11:59AM
  • 9 replies beneath your current threshold.