Forgot your password?

typodupeerror
Social Networks Your Rights Online

Facebook Kills Dataset of Crawled Public Profiles 158

Posted by CmdrTaco
from the creepy-crawlies dept.
holy_calamity writes "Internet entrepreneur Pete Warden wrote a crawler that collated the public profiles of 210 million Facebook profiles and was set to release an anonymised version to researchers. The pages crawled can be read by any web user, and the robots.txt did not forbid crawling. However, Facebook claimed he had violated its terms of service and threatened legal action. Fearing costs, Warden has now destroyed his dataset. For a snapshot of the insights that data could have allowed, see Warden's post on how the friend networks of the 120 million US users in his data segregated into seven clusters." Of course, if he had it, this means anyone who wants it made their own version of this.
This discussion has been archived. No new comments can be posted.

Facebook Kills Dataset of Crawled Public Profiles

Comments Filter:
  • by eldavojohn (898314) * <my/.username@@@gmail.com> on Wednesday March 31 2010, @11:19AM (#31688234) Journal

    Fearing costs, Warden has now destroyed his dataset.

    Couldn't Warden have sent requests to the EFF to provide lawyers so he could fight an evil corporation to use freely publicly available information?

    Then Facebook could ask the EFF to protect their user's privacy and information being sold to marketers and corporations (sorry, when you're introduced as "Internet entrepreneur" that means there's profit to be had).

  • by 2obvious4u (871996) on Wednesday March 31 2010, @11:31AM (#31688434)
    Isn't this the golden egg of Facebook, I though this is what they were selling. That data is fascinating, it is completely anonymous, yet at the same time very insightful for marketing purposes. I think Facebook is just upset because they plan on selling the same data that Pete was.
  • Publicly available (Score:5, Interesting)

    by mdsharpe (1051460) on Wednesday March 31 2010, @11:32AM (#31688452)
    Since this is publicly available information, and all he did was send a program to go grab it (much akin to asking your web browser to download it), does this mean Facebook has essentially threatened him for no more than reading too much of Facebook too quickly? Sounds absurd to me.
  • by TheSpoom (715771) <slashdot@@@uberm00...net> on Wednesday March 31 2010, @11:41AM (#31688564) Homepage Journal

    They did something similar to FB Purity [fbpurity.com], a Greasemonkey script that allows users to filter out apps and other stuff they don't want to see in their feed. Facebook argued that they were misusing their "FB" trademark... eventually they let them continue under the name "fluff busting purity", probably due to the PR backlash that shutting them down would bring.

    They've also shut down the Facebook portion of the Web 2.0 Suicide Machine [suicidemachine.org], which runs scripts that allow a user to delete their social profiles as thoroughly as sites will allow. In that case, they argued that the Suicide Machine was violating their "Statement of Rights and Responsibilities"... which isn't even a law! Nonetheless, the Suicide Machine didn't have the financial ability to fight even frivolous claims like that, so they folded that section.

    Facebook apparently believes that its users will continue using the site regardless of the ridiculous access policies that their legal department create and defend. I hope they're wrong.

  • You assume such anonymization is actually possible, I somehow doubt it.

  • by way2trivial (601132) on Wednesday March 31 2010, @11:46AM (#31688632) Homepage Journal

    I'm sorry- it is..

    robots.txt allows you to "refuse a specific named bot" or "refuse everyone" or "allow everything" or "allow these directories" or "only allow these directories"
    (want a fascinating read? try robots.txt at your favorite government site- whitehouse.gov used to be fascinating stuff)
    there is no way in robots.txt to permit crawling based on intent of information use like a CC license does

    I can- with photographs, have a creative commons license that sez "use it for anyhting" "use it with credit to me" "free for non-commercial" etc.
    I would WANT google to see my site, I would want bing to see my site- for the purposes of indexing in a search engine.
    I can't say in robots.txt
    "come in and index for search engines and relevance- but you may not use the data to collect information on our membership for marketing to or marketing their info to others"

    If I build a website all about-- coffee- I want the information available to the general public,but from/on my site....

  • by sexconker (1179573) on Wednesday March 31 2010, @11:53AM (#31688722)

    I see very little problem with an automated scan that respects robots.txt.

    By not blocking automated access to the profiles, facebook is squarely at fault.

    I see very little problem with an automated scan that doesn't respect robots.txt. (As long as it's accessing stuff normal people can get to.)

    Anything a machine can do, a meatbag can do, though usually more slowly.
    Most anything a meatbag can do, a bunch of meatbags can do much more quickly.

    Robots.txt says go away? Amazon's Mechanical Turk says Thank You, Come Again.

  • Don't worry... (Score:3, Interesting)

    by turbotroll (1378271) on Wednesday March 31 2010, @12:06PM (#31688876)

    Somebody else will do it again, this time anonymously and with an evil robot that hides its tracks. It only takes perl, LWP, MySQL, tor and a little time and imagination to do so.

    Fuck you, Zuckerberg.

  • by way2trivial (601132) on Wednesday March 31 2010, @12:11PM (#31688958) Homepage Journal

    and I really think it is worth making.

    Copyright protections are important, the snippet of text that google uses to let people know my site is relevant is easily fair use
    I don't have a problem with it- I welcome it as it's beneficial for both myself and google for it to be there.

    the ENTIRE TEXT of my site- copied and recopied to put into a web page that exists only to generate ad-sense revenue by a third party is not.
    and if robots.txt had a 'license' mode, I'd have a much stronger case of protections if I chose to pursue a blatant copying and re-publication of my site.

    robots.txt labels that I wish there were include
    'allow function:indexing'
    'disallow function:total and complete reproduction'
    'disallow function: total and complete reproduction for XXX days'
    (so I can allow wayback machine and equivalents'
    'disallow function: aggregate data collection'
    'disallow function: user data collection'
    'disallow function: email collection'

    looking at amazon, http://www.amazon.com/robots.txt [amazon.com]
    they somewhat do this by putting the information they don't want into the wild in it's own directories
    then disallowing those directories- actually, now that I look at it- it's a neat way to go..
    but I'd still prefer a robots.txt option that different 'intended use of data to be crawled' permissions covered

  • by Anonymous Coward on Wednesday March 31 2010, @12:20PM (#31689066)

    Even with names removed, data like this can often be traced back to the person. Your name isn't the only unique thing that appears in your facebook profile.

    As an example, how many others share your permutation of friends and fan pages?

  • Re:On what grounds? (Score:1, Interesting)

    by Anonymous Coward on Wednesday March 31 2010, @12:30PM (#31689226)

    This is America, defending yourself in court against a lawyer is legal suicide. I could argue that Cyanide is lethal and Dynamite is combustible in an American Court but if I were up against a lawyer I guarantee I would lose. Despite that these are practically non-disputable facts the American Court System is setup so it is impossible to argue respectably without paying the Lawyer Tax.

    Example:
    1.) I go into court and argue that Cyanide Brand X should carry a "Poison" label.
    2.) Theoretical makers of Cyanide Brand X hire 5 lawyers, because they can.
    3.) Lawyers state as defendant they wish to have a trial by jury (a right guaranteed by the constitution, called a Jury of you Peers)
    4.) Jury selection weeds out anyone with previous knowledge of the effects of Cyanide, and anyone with background in biology or chemistry because they would not be impartial.
    5.) The result is a jury of people who are completely un-knowledgable and as such completely persuadable either way.
    6.) The Lawyers of Cyanide Brand X bring in a variety of "Expert Witnesses" who are of course "compensated for their time" and who state that no Cyanide doesn't kill you.
    7.) Because the Jury is 100% impartial and also 100% uninformed besides what they have been told in court, their only choice is to assume these Paid or Compensated "Expert Witnesses" were correct because they are scientists!
    8.) The result is that you I lost a case arguing what should have been a foregone conclusion to begin with, because somebody brought more money and lawyers than you.

  • by The Moof (859402) on Wednesday March 31 2010, @12:42PM (#31689378)

    but if he actually wants to do real work and real research with these data, he's got to play by the rules of the real world...

    The summary says the crawler simply indexed public information. Why is this relevant? Well, recently, I noticed that Facebook Apps, all of which I have all disabled and blocked via my privacy settings, have started accessing my information again. Naturally, I assumed something got reset and started hunting for the settings again. Until I found this new block of text in all of their privacy settings:

    When you visit a Facebook-enhanced application or website, it may access any information you have made visible to Everyone Edit Profile Privacy as well as your publicly available information. This includes your Name, Profile Picture, Gender, Current City, Networks, Friend List, and Pages. The application will request your permission to access any additional information it needs.

    So they claim they can't stop people from acquiring and using my 'publicly available' information, because it's open to the public. Then, they turn around and go after this guy for indexing and using the same 'publicly available' information.

    It all sounds a little two-faced to me.

  • by NeutronCowboy (896098) on Wednesday March 31 2010, @12:49PM (#31689466)

    Most likely. Facebook's gold mine isn't even so much the user information itself - it's the networks that they can build out of the relationship data. As of right now, they haven't figured out a way how to make money from it, but they certainly aren't going to let someone take the most valuable aspect of their system - the network information - and put it out in the open.

    Personally, I hope someone does the same work, but uploads the raw data anonymously to a torrent somewhere.

  • by clone53421 (1310749) on Wednesday March 31 2010, @02:35PM (#31691046) Journal

    You will not collect users’ content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission.

    An empty robots.txt is not blank-check permission to crawl and use the data for whatever you want.

The disks are getting full; purge a file today.

Working...