Follow Slashdot stories on Twitter


Forgot your password?
Facebook Privacy Social Networks Your Rights Online

24-Year-Old Asks Facebook For His Data, Gets 1,200 PDFs 291

chicksdaddy writes "Be careful of what you ask for. That's a lesson Max Schrems of Vienna, Austria learned the hard way when he sent a formal request to Facebook for a copy of every piece of personal information that the social network had collected on him, as required under European law. After a wait, the 24-year-old law student got what he was seeking: a CD with all his data stored on it — 1,222 files in all. The collection of PDFs was roughly the length of Leo Tolstoy's War and Peace, but told a more mundane story: a record of Schrems' years-long relationship with the world's largest social network, including reams of data he had deleted. Now Schrems is pushing Facebook to disclose even more of what it knows."
This discussion has been archived. No new comments can be posted.

24-Year-Old Asks Facebook For His Data, Gets 1,200 PDFs

Comments Filter:
  • Not that uncommon (Score:5, Interesting)

    by james_van ( 2241758 ) on Tuesday December 13, 2011 @08:59PM (#38364448)
    I've worked for a number of tech companies that dont actually delete anything, the simply mark the record "deleted" in the database. It's a pretty common practice that didn't really ever get talked about until it came to light that Facebook did it. Let's face it, once something is out there, it never ever really goes away, whether it be on Facebook or somewhere else,
  • Re:LOL (Score:3, Interesting)

    by Anonymous Coward on Tuesday December 13, 2011 @09:51PM (#38364876)

    Yes, they're getting better, but there are inherent problems in the methodologies of stylistic analysis that make any claims of being able to identify authors based on style alone open to extreme skepticism. To put it another way, the only people claiming they can ID you based on how you write are marketing droids or snake-oil salesmen.

    I did some work in a highly related field, stylochronometry. That's the measurement of change over time in a single author's style. The classic problem set for this kind of work is the Platonic corpus: people try to write algorithms to order Plato's writings chronologically. Philosophers want this information so they can trace the development of Plato's thought over time, so they give the problem to computational linguists, who try to measure things like the frequency of certain kinds of sentences or phrases or particles (hard to define words that show the relationship between sentences or, even more vaguely, give phrases "flavor") in various texts and then compare those frequencies to generate trends. There's generally an assumption that at least some of these variables will have a linear increase or decrease over time. More problematic, though, is that Plato may have gone back and edited parts of texts or entire texts, and there's some evidence (from outside these methodologies) that indicates this is the case. These problems have caused some (very rightly) to call into question the validity of stylochronometry, and the fact of the matter is that each study that's been done comes up with a different sequence in which the texts were written. It's a lot of effort being thrown at a problem in vain.

    The same problems plague the study of authorship of anonymous internet posts through stylistic analysis. On Slashdot, you can't edit, but you can on blog posts, and you can have multiple authors collaborating without attribution. There's also plagiarism to complicate the number of authors: you don't know if person X's post is entirely his own or if parts were snagged from elsewhere, which would throw an algorithm off track. Most importantly, the basic assumption of stylochronometry, that style changes with time, causes a problem for algorithms that seek to find correlation among posts that were written at different times. Worse, people change their style from day to day or hour to hour (maybe I'm babbling now because I've had a lot of rum; maybe I'm usually more concise) and from context to context (maybe I write one way when responding to some articles, but I cite more sources on others, or I troll in other environments like ZeroHedge, or I use lots of abbreviations when discussing my furry anime fetishes -- rhetoric depends on context).

    Things on the internet won't be traced back to you unless you're a bot that always writes in the same style. And, you'll never discover the order in which Plato wrote his dialogues.

  • by blueg3 ( 192743 ) on Tuesday December 13, 2011 @10:42PM (#38365224)

    Reliably? Yes. Sure, it's easy to delete the copy in the production database. It's harder to prove that if the disks backing the production database were stolen and analyzed, it would be impossible to recover the data. It's harder still to locate and redact every backup of the database that contains the data. (It's even harder still to prove that a copy of the data doesn't persist on another user's hard drive as a result of having viewed the data in a web browser.)

    This is the Cloud Era; you can't reliably delete data any more.

  • by MaskedSlacker ( 911878 ) on Tuesday December 13, 2011 @11:03PM (#38365404)

    I'm pretty sure the buttons actually say "Remove" which is a nifty semantic cheat around that problem.

  • by neoform ( 551705 ) <> on Wednesday December 14, 2011 @01:09AM (#38366262) Homepage

    Very few people understand the technical ramifications of 'deletion' on large infrastructure. It's very likely that facebook can't actually 'delete' much the same way InnoDB can't recover disk space after a delete (which means the data still exists on the hard drive).

  • by pclminion ( 145572 ) on Wednesday December 14, 2011 @02:54AM (#38366890)

    And if your life was any interest to anyone, there'd be people working a lot harder to penetrate your privacy.

    You're trying to look at an elephant through a microscope. The danger isn't the violation of any one person's privacy. The danger is the emergence of a kind of "total information awareness," where inferences can be drawn on larger social scales. For instance, detecting when a protest is about to materialize, measuring the effectiveness of propaganda techniques, tracking politically unfavorable trends in conversations, etc.

    I'm not in principle opposed to the ability to do that, but right now the ability is very one sided. Facebook (as well as any government who can order them to do things) has all the information. We don't.

  • by LordLimecat ( 1103839 ) on Wednesday December 14, 2011 @12:08PM (#38370716)

    Good thing theyre based in the US, then, huh? Maybe folks in europe shouldnt be visiting the site if its illegal.

  • by msobkow ( 48369 ) on Wednesday December 14, 2011 @01:43PM (#38372186) Homepage Journal

    I agree, it should be your choice. However, I'm one who really, really likes the idea of keeping an edit history for posts if one so chooses.

    And I can understand why Facebook doesn't actually delete the data, but just flags it as hidden/deleted -- it's a real bear to update and nullify all the object id references to a post in such a mammoth system. There are links all over the place from people whose "feed" pages may reference your post. There are forwards and reposts of your post which create a commented link to your post -- does your right to delete your post mean you have the right to delete the posts of people who've commented on it?

    Given that some of the content links could be in archived databases instead of mainline storage or cache, updating them could be virtually impossible.

    Canada is facing the same issue with it's Long Gun Registry being shut down by Harper's Conservative government -- the data is cross-linked throughout government and law enforcement system, with over a decade of archived databases referencing the LGR databases. Truly deleting the data requires restoring the archived external databases, updating their contents to remove the references, exporting the database for an updated backup, and archiving it for storage.

    Now there's the cascade effect -- any references to the archive disks now have to be updated to reference the new archive database content instead of the original.

    They're currently expecting it to take over FIVE YEARS to purge that one database, and it's pitifully small compared to Facebook or Google.

    Never mind the potential legal issues of external and archive systems that are mandated to be write-only by government legislation, and which have to be retained for 7-10 years in many cases.

    Realistically, a versioning system or flagging content as deleted instead of purging it is the only option available for large systems that maintain historical data of any significant size.

In 1869 the waffle iron was invented for people who had wrinkled waffles.