Forgot your password?
typodupeerror
Government Communications Social Networks

Library of Congress To Archive All Public Tweets 171

Posted by timothy
from the he-ain't-heavy-he's-less-than-140-chars dept.
After the recent announcement that Groklaw will be archived at the Library of Congress, mjn writes with word that the push to archive more digital content continues: "The US Library of Congress announced a deal with Twitter to archive all public tweets, dating back to Twitter's inception in March 2006. More details at their blog. No word yet on precisely what will be done with the collection, but besides entering your friends' important updates on the quality of breakfast into the permanent archival record, the deal may improve access for researchers wanting to analyze and mine Twitter's giant database."
This discussion has been archived. No new comments can be posted.

Library of Congress To Archive All Public Tweets

Comments Filter:
  • by Third Position (1725934) on Wednesday April 14, 2010 @03:21PM (#31848468)

    Given the signal to noise ratio for most tweets, I'm not convinced this is a particularly good use of resources...

    Just because you can do something, doesn't mean you have to!

    • by sopssa (1498795) * <sopssa@email.com> on Wednesday April 14, 2010 @03:24PM (#31848502) Journal

      It's not like it takes a lot of space to archive them, it's just 140 characters per tweet. There's a lot of useless information in the newspapers and books too, but they have archived them too because some of that info is valuable or might become valuable.

      • by Anonymous Coward on Wednesday April 14, 2010 @03:41PM (#31848738)

        Hi, @librarycongress! I just took a shit. I am honored that you will be archiving this momentous occasion for future generations.

        • by Dogtanian (588974)

          Hi, @librarycongress! I just took a shit. I am honored that you will be archiving this momentous occasion for future generations.

          Obligatory [penny-arcade.com].

        • by severoon (536737)

          Yes, this truly is a giant database. Let us do math.

          140 characters/tweet * 2 bytes/character * 12E9 tweets = ~3.36TB

          O. M. G. This would fill more than half the hard disk space I have in my NAS...truly massive! (At my company, there was an April Fool's rumor going around on the day that Twitter would be going down for 10 minutes while their high school intern upgraded their "Tweet Storage Unit" (TSU) by adding an extra 2TB drive. Har har! To be fair, they store a good bit of metadata besides the tweet itsel

      • Re: (Score:3, Insightful)

        by blair1q (305137)

        50 million tweets/day
        140 characters of message
        60 bytes of metadata (timestamp, sender id, etc.)

        10 GB of twitter archive per day
        10 TB per 3 years

        What does 1 TB cost these days? about $100?

        Storage space will indeed be an inexpensive part of the cost, and will decline in price at about the same rate the traffic is growing.

        • That's uncompressed. Toss in bzip, gzip-9 or 7za. It's just plain text it should compress >80% rather well.

          • by Myopic (18616)

            Indeed I agree, especially given the overlap in the topics people tweet about, thus the words/text used in tweets.

        • Re: (Score:3, Funny)

          by RealGrouchy (943109)

          50 million tweets/day
          140 characters of message
          60 bytes of metadata (timestamp, sender id, etc.)

          10 GB of twitter archive per day
          10 TB per 3 years

          Yes, but how much is all of that in Libraries of Congress?

          - RG>

    • by Anonymous Coward on Wednesday April 14, 2010 @03:25PM (#31848522)

      And just because you don't have to, doesn't mean you shouldn't!

      This is probably the best way to capture a snapshot of our current society. Sure, the barrier for entry is a little lower, but I think this will be invaluable for historians who look back and try to understand us.

      Or, if anything, it'll confuse the hell out of them .

      Everyone wins either way!

      captcha: formally

      • by eln (21727)
        I'm reminded of the Futurama episode where they go to a museum of the 20th century and everything there is ridiculously inaccurate because of how information tends to get lost and garbled over time. I can just imagine what a museum of the 21st century will look like if their primary source is old tweets. They'll probably think our self-imposed 140 character limit was due to some bizarre superstition and we worshiped someone known only as "aplusk" as a God whose wisdom came down to us in the form of what w
        • by pushing-robot (1037830) on Wednesday April 14, 2010 @05:18PM (#31850190)

          I's fun to think of historians as just attributing everything they learn about societies to religion and superstition, but the biggest reason we think pre-Enlightenment civilizations were obsessively religious is because the priest castes were generally among the most literate and the most concerned with preserving knowledge of the past. Much of what we know about history comes through their writings—and therefore, their perceptions. They quite literally wrote history, to a large extent, and our understanding of their society is colored by their bias.

          The Information Age has democratized knowledge to a huge degree. Historians centuries or millennia hence will have plenty of sources other than the lens of the Catholic Church. Given current trends, even just a decade from now a few consumer-grade storage devices could hold everything the Library of Congress or Archive.org contains today. As long as there are a few people in the world interested in preserving it, modern history should be safe.

          • by Myopic (18616)

            Hey, that's interesting and insightful. I never thought of it that way before. I wonder if skeptics and nonbelievers were as common then as now. (In America I peg us at about 20% of the population, with about half of us being in the closet.)

            Imagine an ancient ritual sacrafice of a virgin or something, and one fifth of the crowd is sort of rolling their eyes thinking "really? I mean, really? You guys think that stabbing a girl with a hymen is going to bring you blessings from magical beings in the sky? Get a

            • by socsoc (1116769)
              Depending on the geographical region of your anecdote, I'd peg it much higher.
              • by Myopic (18616)

                I live in Wisconsin, grew up in Alaska, and lived for a while in New Hampshire and Massachusetts.

                Also to be clear, for skeptical nonbeliever I refer not only to Christianity or its similar easy-to-characterize religions, but also the Eastern sorts of religions, and the New Age sorts of religion (ghosts, "energy", pagan spirits).

                Obviously, I hope you are right that we amount to greater numbers. Where do you live?

                (Skepticism is a shockingly overdue prevalent worldview.)

                • by socsoc (1116769)

                  California, but I'd say that it's closer to 40 (more if we include skeptics that attend church services just to appease someone else). And maybe it's due to the 20s-30s age group too.

                  In WI, I am actually surprised at 20%. I've always felt like an outsider (admittedly not urban areas).

                  • by Myopic (18616)

                    Well, if it's 40% then that's pretty good. But, being from California, can you disclaim the accusation that many Californians are New Age-y hippie cranks?

                    In any case, let's hope our numbers keep increasing. If we get to the 40% you suggest, then I think we might start seeing some Out Atheist politicians.

      • by uncqual (836337)
        Or, it might cause historians to think there was a Little Dark Age in the early part of the 21st century.

        Now, if the LOC would archive /., the historians would know there was a Little Dark Age in the early part of the 21st century (and this post would be evidence that the denizens of the Little Dark Age even knew they were living in such a time).
        • Re: (Score:3, Funny)

          by russotto (537200)

          Now, if the LOC would archive /., the historians would know there was a Little Dark Age in the early part of the 21st century (and this post would be evidence that the denizens of the Little Dark Age even knew they were living in such a time).

          When the historians of the 50th century unearth the records of /., they'll realize the Final Dark Age came upon humans in the early part of the 21st century, and that while many saw something happening, none realized the extent. And then they'll click their mandibles

      • by mdm-adph (1030332)

        You have to remember that the people usually shouting "wargarbal waste of money" to scientific situations such as these aren't the type to give two shits as to generations that come after them, as we've all seen. :(

        Future historians? These people are trying to burn history books today.

      • If they'll be looking at Twitter I don't think I want future historians to understand us.
    • Re: (Score:2, Insightful)

      by mlush (620447)

      Given the signal to noise ratio for most tweets, I'm not convinced this is a particularly good use of resources...

      Just because you can do something, doesn't mean you have to!

      Its a fantastic idea, its probably only a few Tb of data but it represents the unedited reaction of ordinary people to historical events and a detailed insight into their everyday lives.

    • We learned more about ancient Egypt from their twitter then from all the official records designed to be survive the ages. Sure sure, very interesting to read the "unbiased" record of a pharaoh in his own tomb, but it is from the "trash" notes that were recovered that we learned about how the country itself worked. Including such little details as that the pyramids were not made by slaves.

      The official records of the US will be Fox news. Better pray that future researchers have access to some other source,

    • Given the signal to noise ratio for most tweets, I'm not convinced this is a particularly good use of resources...

      Obviously it's a project really funded by the DOD, the highest quality source of entropy yet for cryptography.

  • by Pojut (1027544)

    I could see them archiving tweets that were relevant to pop culture or history...but all of them??? Seems like a waste of time and money to me.

    • Re:hmm... (Score:4, Insightful)

      by Captain Splendid (673276) <capsplendid@@@gmail...com> on Wednesday April 14, 2010 @03:24PM (#31848518) Homepage Journal
      I'm thinking the byte limit on tweets is the main factor here...easier to just scoop 'em all up than to figure how to get the "important" ones.
      • Re:hmm... (Score:4, Insightful)

        by Trepidity (597) <delirium-slashdot AT hackish DOT org> on Wednesday April 14, 2010 @03:34PM (#31848652)

        I suspect a lot of the interesting information is in the aggregate anyway, not individual tweets: things like trends, analysis of subgroups, linguistic analysis, etc.

    • Re:hmm... (Score:4, Insightful)

      by bugi (8479) on Wednesday April 14, 2010 @03:26PM (#31848540)

      all of them???

      Disk space is cheap...

      They should get a copy of the internet archive while they're at it.

      • Re: (Score:3, Funny)

        by Shakrai (717556)

        They should get a copy of the internet archive while they're at it.

        And alt.binaries too. Think of the "research" potential there... ;)

        • by bugi (8479)

          alt.binaries too

          Good idea. Maybe linux-kernel too. Is there a better example of large scale teamwork? For coding, I mean. Not for documenting the downfall of the US legal system.

        • You can make it happen. Come up with a method to encode alt.binaries in 140-character chunks and the Library will archive them all for you.

      • by Dogtanian (588974)

        Disk space is cheap...

        Since it's "twitter", surely that should be "cheep"?

        Uh, sorry. :-(

        Anyway, if Twitter messages are 140 bytes and we assume the overhead averages 30% per message, that's 187 bytes per message.

        5.5 tweets per metric kilobyte.
        5475 tweets per megabyte.
        5,475,935 tweets per gigabyte.
        5,475,935,828 tweets per terabyte.

        Which isn't far short of the earth's population. Figure out the average number of tweets per person on earth, and you know how many $60 1TB hard drives you need to store them all.

        The questi

    • by sopssa (1498795) *

      In the history only popular news or writings were archived. Wouldn't it be interesting to see what someone else, normal people, said about Shakespeare or some kings 1000 years from now? All we have now is what was archived - popular writings that governments agreed to.

      • by Shakrai (717556)

        Wouldn't it be interesting to see what someone else, normal people, said about Shakespeare or some kings 1000 years from now?

        They were probably too busy watching Medieval Idol to even realize who Shakespeare or the King was ;)

        All we have now is what was archived - popular writings that governments agreed to.

        Which is all we'll have in the future, unless you think the United States Government is liable to be around in a thousand years.

        • Re:hmm... (Score:5, Interesting)

          by natehoy (1608657) on Wednesday April 14, 2010 @03:48PM (#31848848) Journal

          They were probably too busy watching Medieval Idol to even realize who Shakespeare or the King was ;)

          A jest, I know, but it does demonstrate a serious point.

          Our history books are based on records maintained by the winners of wars, the leaders, the successful, etc. We know a lot about Shakespeare. We know relatively little about how his audiences actually felt about his work.

          We largely speculate as to how life was for the ordinary folk during historical periods based on writings about them, not writings from them. The exception to this is diaries, and now many people maintain those any more. Twitter can help replace some of that perspective.

          Admittedly, Twitter is not an ideal way to get a picture of a society, but you get to hear historical events told from a very different perspective. Actually, you get to hear them from LOTS of perspectives. They may not be an accurate portrayal of the events, but they are a snapshot of how a society reacts to and perceives events.

          Who will represent the narcissists in society for future generations?

          • by kencurry (471519)
            but most people tweet about mundane crap, not what happened on Capitol Hill. i.e., signal to noise will be horrible for trying to decipher What the Hell Happened...
            • but most people tweet about mundane crap, not what happened on Capitol Hill. i.e., signal to noise will be horrible for trying to decipher What the Hell Happened...

              Not really. I enter "White house" on Twitter's own search features and there is only about 30% noise, 70% stuff relevant to my topic.

              So, in the year 31000 when they discover this data cache from the year 2010, they'll have search algorithms better than we could possibly concieve.

            • but most people tweet about mundane crap, not what happened on Capitol Hill

              Which shows something important in itself - that most people don't care all that much about most of what happens on Capitol Hill.

          • by blair1q (305137)

            That in fact is an ideal reason to do this, and twitter is nearly the ideal forum. The only hole in it is that some people aren't represented. Those who are over- or under-represented can be identified and the weight of their observations adjusted. But those who simply are not recorded will not have had an opinion at all.

            The real problem here is, the LoC is a government entity, and all my experiences with technology provided by government entities has left me less than impressed. Searching the LoC's arc

          • The exception to this is diaries, and now many people maintain those any more.

            Maybe not in written paper form, but certainly many people maintain and update their own blogs, notes, and other status updates on things like Myspace, Facebook, and blogspot. Surely those resources would be a good source for the same type of information that is maintained in diaries. I suppose diaries had/have the added advantage of usually being considered private, so more information may be disclosed in them. However, it's become pretty apparent that there are still many netizens that don't think enough

          • I think Twitter is the ideal way to get a picture of a society. What people say on a daily, mundane level is pretty much what a society IS. The average schmuck doesn't give a rat's ass about what goes on on Capitol Hill (if they even know what Capitol Hill is). A society is made up of people, not leaders.
          • by Jay L (74152) *

            Who will represent the narcissists in society for future generations?

            I will, of course, as I'm sure you all assumed.

        • by lennier (44736)

          They were probably too busy watching Medieval Idol to even realize who Shakespeare or the King was ;)

          Shakespeare was Renaissance English Idol, while Chaucer slammed the Medieval category.

          Just because something is now stuffy 'literature' doesn't mean it wasn't wildly populist entertainment in its time. There's a reason why a lot of Shakespeare centers on drunks, crossdressing and hitting people with swords.

          • by socsoc (1116769)

            drunks, crossdressing and hitting people with swords.

            So you're saying that we should archive /b/?

      • by Pojut (1027544)

        Hmm...that is a good point...

  • by Ordonator (1539087) on Wednesday April 14, 2010 @03:24PM (#31848516)
    Clearly, once they've finished, they plan to destroy the entire world so that they can claim to have truly archived all human knowledge, forever.
  • I think it's a really bad idea to define measurement units recursively.

    1 new Tweet = 0.00000000000000017263 ( the current LoC + the new Tweet )
  • The only time... (Score:3, Interesting)

    by comm2k (961394) on Wednesday April 14, 2010 @03:28PM (#31848556)
    The only time I really actively used Twitter was during the recent LHC 3.5TeV event, because the webstream was completely overloaded. LoC preserving it? Future generations will look back and conclude that some people REALLY did have to TOO much time and trivial stuff to share.
    • Re: (Score:3, Interesting)

      by jfengel (409917)

      Future generations will look back and conclude that some people REALLY did have to TOO much time and trivial stuff to share.

      Sure, why not? You never know what sort of insights you'll get. What people do in their free time is just as important to historians as what they do when they're working. More so, sometimes, since the work is often ephemeral while the free time is an important insight into the culture as a whole.

      Most of it's garbage, but garbage middens are one of anthropology's favorite data sources.

    • by Pojut (1027544)

      I find it to be an extremely useful tool for keeping up on various personalities and the going-ons behind the scenes at certain websites. A sampling of the list of the people I follow:

      PADnD (Penny Arcade live tweets their Dungeons and Dragons games)
      mattsinger (critic for IFC)
      aedavis (Ashley Davis, who draws Once Upon a Pixel)
      washcaps (Washington Caps Hockey official twitter)
      mcps (Montgomery county Public Schools, who my fiancee works for)
      CameronPierce (Bizzaro author)
      CERN (LHC stuff, obviously)
      BenKuchera (

    • Future generations will look back and conclude that some people REALLY did have to TOO much time and trivial stuff to share.

      Which is why its important that we store this information. We know what the history books are going to say. We know that the War on Terror will come out to either be a horrible attrocity that human kind should never try to re-attempt, or it will be declared a huge success that ushered in a new era of peace and stability. People will ask "I wonder what was going through peoples heads?"

      And this is the PERFECT example. It will show that a lot of people didn't do anything, and they'll probably infer it to be Ap

  • Great, we've got a variable constant now.

    • by Qzukk (229616)

      Great, we've got a variable constant now.

      Don't worry, we'll just set up a system to tweet the new value whenever it changes ;)

    • too bad your tinyurl won't be archived. maybe if they did then you can set them up to recursively archive itself. hmmmmmm.
      • by Trepidity (597)

        The LoC isn't archiving URL shortener targets (yet, anyway), but the Internet Archive is on it [archive.org], which at least ups the likelihood that some future researcher will be able to decode what those links pointed to.

  • If they think tweets are worthy of being archived why not just archive every blog and comment in existence? Many of those offer far more worthwhile insight than 99% of tweets.

    I remember in school students and sometimes teachers occasionally mocking the customs of past cultures. There was always that subtle arrogance that we're somehow more enlightened than people were 500, 1000 or 2000 years ago. The problem is that people confuse technological advancements for intellectual and philosophical advancement. I'

    • by Intron (870560)

      If they think tweets are worthy of being archived why not just archive every blog and comment in existence? Many of those offer far more worthwhile insight than 99% of tweets.

      There is a slippery slope here. What happens when the try to archive the Library of Congress within the LOC? The recursive archiving would destroy them.

      With the massive proliferation of every last inane comment preserved for posterity I can only imagine how utterly stupid we are going to look to people of the future.

      Take that, future people!

  • Library of Congress To Archive All Public Tweets

    ... Twitter's new (publicly funded) "Backup and Data Retention Plan".

    Okay, I'm sure someone (probably The Daily Show) will, at some point, find something useful in all that noise.

  • Legal implications? (Score:3, Interesting)

    by slimjim8094 (941042) <slashdot3 AT justconnected DOT net> on Wednesday April 14, 2010 @03:35PM (#31848662)

    All 'useless twits' jokes aside, this is pretty interesting. But I wonder if they'd run into any copyright laws.

    Reading the Twitter ToS turns up with this:

    You retain your rights to any Content you submit, post or display on or through the Services. By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed).

    which looks to me like posters retain copyright, but Twitter retains the right to grant others the same license you've granted them (non-exclusive license to provide their service).

    So based on my reading, Twitter (and the LoC) are in the clear?

    • I think this would be legal regardless of what the ToS says. See the exemptions given to libraries and archives in 17 USC 108 [copyright.gov].

    • by petsounds (593538)
      It seems like your reading is probably right, but I would hope they would at least anonymize the data. It seems like quite the invasion. Right now, one can only find tweets from a few weeks prior in Twitter's public search. Now anyone can request any prior tweet.
  • Small data set (Score:4, Interesting)

    by fulldecent (598482) on Wednesday April 14, 2010 @03:46PM (#31848796) Homepage

    Math for the day:

    Without compression, all tweets in human history will fit on a single hard drive costing less than $100.

    http://search.twitter.com/search?q=a [twitter.com] (to find the latest tweet number)
    http://twitter.com/about [twitter.com] (character limit)
    http://www.pricewatch.com/hard_removable_drives/ [pricewatch.com] (1.5TB drive)Delete

    http://www.google.com/buzz/fulldecent/18tfNfPHSBp/Math-for-the-day-Without-compression-all-tweets-in [google.com]

  • I wonder if they're going to archive stuff like identi.ca too, or any other related platform.
  • While on a whole twitter is very important, most likely in an importance vs amount comparison they would rate as one of the lowest scoring collections of data of all time.

  • You mean advertisers and Stasi. Ugh.

    Yeah, yeah, it's public. Agreed. And everybody knows there's no difference whatsoever between what some guy can read and an exhaustive, automated audit trail and connection map of everything that has ever been posted. That's why nobody uses search engines, after all.
  • by SeaFox (739806)

    All your tweet belong to us!

  • They should have been archiving Usenet from the beginning.

  • And don’t even ask about Wikileaks as a whole...

  • Given that we can store almost 525 bytes [ksplice.com] of data in a single twit (I refuse to call them tweets), which is enough for a sector of data plus metadata, could it now mean we can store our data permanently at taxpayer's expense?

    I call it TwitterShare as a play on RapidShare to send files easily... and now those files will be forever archived. Sounds like a good way to backup data to me! Other than letting everyone else in the world see your files...

  • ...of archived gopherspace content I'm willing to donate to the LoC. Seems to me this dated motherload of data would have far more historical significance and impact than thousands upon thousands of dissociated mindfarts.

  • I'm putting my Library of Congress stock recommendation to STRONG SELL.

  • I find it quite ironic Library of Congress would be spending time archiving totally useless things like twitter.com postings, at the same time ignoring the thousands (if not hundreds of thousands or millions) of books in thier archive that they have yet to make public. I would say their first priority should be in making sure that everything that is in their actual Library gets put online and made public first, then after that work is done, then talk about doing other things. It is all a pretty big waste
  • How long before someone comes up with a scheme to backup files in encoded tweets "for posterity"?

    Seriously, they should be spending their effort on funding or replicating the Internet Archive instead.

FORTRAN is a good example of a language which is easier to parse using ad hoc techniques. -- D. Gries [What's good about it? Ed.]

Working...