Copyright Tool Scans Web For Violations 185

Posted by Zonk on Tuesday December 19, 2006 @12:21PM from the he-knows-when-you've-been-bad-or-good dept.

The Wall Street Journal is reporting on a tech start-up that proposes to offer the ultimate in assurance for content owners. Attributor Corporation is going to offer clients the ability to scan the web for their own intellectual property. The article touches on previous use of techniques like DRM and in-house staff searches, and the limited usefulness of both. They specifically cite the pending legal actions against companies like YouTube, and wonder about what their attitude will be towards initiatives like this. From the article: "Attributor analyzes the content of clients, who could range from individuals to big media companies, using a technique known as 'digital fingerprinting,' which determines unique and identifying characteristics of content. It uses these digital fingerprints to search its index of the Web for the content. The company claims to be able to spot a customer's content based on the appearance of as little as a few sentences of text or a few seconds of audio or video. It will provide customers with alerts and a dashboard of identified uses of their content on the Web and the context in which it is used. The content owners can then try to negotiate revenue from whoever is using it or request that it be taken down. In some cases, they may decide the content is being used fairly or to acceptable promotional ends. Attributor plans to help automate the interaction between content owners and those using their content on the Web, though it declines to specify how."

This discussion has been archived. No new comments can be posted.

Search 185 Comments Log In/Create an Account

Comments Filter:

Can't they just use google or torrent sites? (Score:3, Informative)

by LiquidCoooled ( 634315 ) writes: on Tuesday December 19, 2006 @12:33PM (#17300870) Homepage Journal

Can't they just use google or torrent sites?
If users can find items they want, presumably the copyright holders could use the same methods...

Re:i don't like robots.txt anyway. (Score:5, Informative)

by FooAtWFU ( 699187 ) writes: on Tuesday December 19, 2006 @12:57PM (#17301150) Homepage

You're absolutely right that "if you don't want it on the public Web, don't put it there in the first place" -- but there are still times when you have a legitimate reason that you don't want a page indexed, downloaded, or otherwise visited by a robot. Dynamically generated content is one example reason; sometimes certain pages can be a big drain on your website, and you'd prefer not to have every spider in the world hitting them up every few minutes.
Let's take a fun legitimate site like, oh... Wikipedia [wikipedia.org]:

# Folks get annoyed when VfD discussions end up the number 1 google hit for # their name. See bugzilla bug #4776 # en: Disallow: /wiki/Wikipedia:Articles_for_deletion/ Disallow: /wiki/Wikipedia%3AArticles_for_deletion/ Disallow : /wiki/Wikipedia:Votes_for_deletion/ Disallow: /wiki/Wikipedia%3AVotes_for_deletion/ Disallow: /wiki/Wikipedia:Pages_for_deletion/ Disallow: /wiki/Wikipedia%3APages_for_deletion/ Disallow: /wiki/Wikipedia:Miscellany_for_deletion/ Disallow : /wiki/Wikipedia%3AMiscellany_for_deletion/ Disall ow: /wiki/Wikipedia:Miscellaneous_deletion/ Disallow: /wiki/Wikipedia%3AMiscellaneous_deletion/ Disallo w: /wiki/Wikipedia:Copyright_problems Disallow: /wiki/Wikipedia%3ACopyright_problems

(They also disallow certain specially generated pages like Special:Random, and any of the pages which actually let you edit the site).
Let's see, what are some other sites? Ooh. Take a look at Slashdot's robots.txt [slashdot.org]! (disallows a variety of fun pages.) Microsoft's? [microsoft.com] How about whitehouse.gov [whitehouse.gov]? Google [google.com]?

Re:Wager (Score:3, Informative)

by Crudely_Indecent ( 739699 ) writes: on Tuesday December 19, 2006 @01:35PM (#17301542) Journal

Another company "Cyveillance" already does this for major corporations and the government. I've used htaccess rules to disallow all from their assigned netblocks after they racked up almost 20,000 hits to my personal site in one day. As you mentioned, they didn't follow robots.txt and attempted to index parts of my site that are password protected as well as content names that did not exist (music and videos and such), all the while identifying their bot as a variant of IE.

Here's how to block two subnets using htaccess and mod_rewrite on apache:

RewriteEngine On RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$" [OR] RewriteCond %{REMOTE_ADDR} "^63\.146\.13\.6([4-9]|[7-8][0-9]|9[0-5])$" Rewri teRule ^(.*)$ - [F]

Line 1 activates the rewrite engine
Line 2 sets the condition to include remote addresses 63.148.99.224-255 and includes [OR] to allow further processing
Line 3 sets the condition to include remote addresses 63.146.13.64-95
Line 4 sets the rule that any url be forbidden

So, save your bandwidth by denying access to your content from unauthorized viewers (bots)

Re:search by hash? (Score:4, Informative)

by Johann Lau ( 1040920 ) writes: on Tuesday December 19, 2006 @01:45PM (#17301650) Homepage Journal

"Unaltered media files" are the exception, not the rule. Changing even a bit of metadata (stripping exif from an image, changing an mp3 tag) would change the checksum, not to mention things like putting things into an archive, resizing images, (re)recompressing music.

But yeah, it might make sense for Google to become "aware" of unique content and variations of it.. but I doubt they'd ever use that openly for (aiding in) hunting down copyright infringement, simply for PR reasons.

Re:i don't like robots.txt anyway. (Score:5, Informative)

by mandelbr0t ( 1015855 ) writes: on Tuesday December 19, 2006 @02:32PM (#17302496) Journal
Dynamically generated content is one example reason; sometimes certain pages can be a big drain on your website

And dynamic content is, of course, the answer. If I'm going to put up copyrighted content in the future, I'd use one of a dozen schemes that regenerate the download link on a per-session basis. Obviously they're not going to honour robots.txt, but why are your links readable by such a basic spider? You need to:
1. Disallow anonymous downloads. You need to be logged onto the site to download anything, torrent or otherwise
2. Use a CAPTCHA to prevent spiders from signing up for said accounts
3. Use the session id to generate unique download links on a per-session basis
4. Change the key on your BitTorrent tracker every 12-24 hours. This will require that a downloader get the latest torrent from the original website (which requires login), reducing the impact of a leaked torrent
5. Compress and possibly encrypt the content so that it's less obvious what it is
Anyone who follows the above steps (and most sites already do most or all of this) won't be found by the spider. Period.

The only thing I can think of that this product would be useful for is to find people who have blatantly copied my website, but I'm sure you could find those people equally easily with Google.

mandelbr0t
I've experienced it from both sides. (Score:3, Informative)

by bcrowell ( 177657 ) writes: on Tuesday December 19, 2006 @05:20PM (#17305144) Homepage

I've experienced this from both sides.

I have a bunch of my books on the web, and every once in a while I do a search on some text from my own books to see who else is mirroring them. The books happen to be copylefted (dual-licensed GFDL/CC-BY-SA), but I'd like to know who's mirroring them, and check whether they're violating the license. A lot of people just seem to be hoarding the PDF files on their university servers, maybe because they're afraid my web site will disappear; that's flattering. One guy was selling them on CDs on e-bay, violating my license (claimed they were PD, didn't propagate the license). Another guy translated them to html, with lots of errors, changed the license to a more restrictive one, and put his own ads up; he fixed the licensing violation when I complained, and in a way it was a good thing, because it motivated me to make my own html versions (which are now bringing me a significant amount of money from adsense every month). One kind of annoying thing about mirroring is that the people who are mirroring never bother to update their mirrors, but in general I just figure there's no such thing as bad publicity :-)

From the other side, I once received an e-mail from a museum in the UK that was complaining that I was using a 17th century oil painting of Isaac Newton. I guess they own the original, and they may also have been the ones who did the scan that I found in a google image search, but under U.S. law (Bridgeman Art Library, Ltd. v. Corel Corp.), a realistic reproduction of a PD two-dimensional art work is not copyrightable. What really surprised me was that they came across it at all, because at that time I think my book was only in PDF format, and hadn't been indexed by google because the file size was too big.

The whole thing doesn't seem negative to me in general. It makes just as much sense as people doing a vanity search in Google before they apply for a job, or authors watching their amazon.com sales rankings obsessively. I guess the most obvious potential for abuse would be if they send a nastygram to your webhost, and your webhost is a low-end one that figures it's not worth their time to keep your account, so they just shut off your account.

Re:Wager (Score:3, Informative)

by BrynM ( 217883 ) * writes: on Wednesday December 20, 2006 @12:50AM (#17309284) Homepage Journal

There's an easier way. You can hand mod_access netblocks and more [apache.org]. This method will avoid eating cycles with mod_rewrite. If you can put it in your conf instead of .htaccess, you'll save even more time/processing. Just put it in for your doc root. From my httpd.conf:

<Directory "/var/www/htdocs/"> # BRYN'S DENIALS # allresearch.com deny from 209.73.228.160/28 # branddimensions.com user-agent: BDFetch deny from 204.92.59.0/24 # cyveillance.com deny from 63.148.99.224/27 deny from 65.118.41.192/27 # www.markwatch.com user-agent: markwatch deny from 204.62.224.0/22 deny from 204.62.228.0/23 deny from 206.190.160.0/19 # nameprotect.com user-agent: NPBot deny from 12.40.85.0/24 deny from 12.148.196.128/25 deny from 12.148.209.192/26 deny from 12.175.0.32/28 # rocketinfo.com deny from 209.167.132.224/28 # END BRYN'S DENIALS </Directory>

Now I gotta look up IPs for these clowns... damn copyright ambulance chasers... arin.net here I come!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Copyright Tool Scans Web For Violations 185

Copyright Tool Scans Web For Violations More Login

Copyright Tool Scans Web For Violations

Can't they just use google or torrent sites? (Score:3, Informative)

Re:i don't like robots.txt anyway. (Score:5, Informative)

Re:Wager (Score:3, Informative)

Re:search by hash? (Score:4, Informative)

Re:i don't like robots.txt anyway. (Score:5, Informative)

I've experienced it from both sides. (Score:3, Informative)

Re:Wager (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot