The Library of Congress Will Stop Archiving Every Public Tweet On January 1st (gizmodo.com) 79
An anonymous reader quotes a report from Gizmodo: In 2010, the Library of Congress started archiving every single public tweet that was published on Twitter. It even retroactively acquired all tweets dating back to 2006. But the Library of Congress will stop archiving every tweet on December 31, 2017. The Library of Congress issued a white paper this month saying that it was proud of its comprehensive collection of tweets from the first 12 years of Twitter, but that it's completely unnecessary for it to continue. Instead, the organization will only collect tweets that it deems historically significant. For instance, President Trump's tweets are almost certainly still going to be saved for future generations. One reason that the Library is stopping the comprehensive archive? The social media company's controversial change to allow 280 character tweets. The Library's halt on collection of all tweets puts Twitter more in line with the way that other digital collections are archived, including websites. The Library of Congress only archives websites on a selective basis, unlike the nonprofit, non-governmental organization the Internet Archive, which has a much broader goal of archiving everything online with its Wayback Machine. The Library of Congress also noted that many tweets include photos and video and that it has only been collecting text, making some of its collection worthless.
How much data is that per year? (Score:2)
Re:How much data is that per year? (Score:5, Informative)
Assuming they are only archiving text, I wonder how much storage that requires. Of course it would compress VERY well.
On a good day about 1 bit per character.
Re: (Score:2)
Still, I'm curious to see some real numbers...
These do something about it. Is your curiosity only ever itched by you asking other people to teach you? For fuck sakes it will be the top of pretty much any google search you can manage to construct about the subject.
Re: (Score:2)
Post-compression could be 1 bit per character on average, I suppose.
The information content of the average tweet is less than 1 bit so compression ratios should be FAR higher than that.
(I wouldn't be surprised if the whole of this years tweets could be compressed onto a single floppy if we use the previous eleven years as a dictionary).
Re: (Score:2)
Typically you'll see 3:1 to 4:1 for text obviously you'll trade off speed for compression so about 2 bits per character and that is if you don't use a streaming algorithm, then you'll see closer to 2:1.
Re:How much data is that per year? (Score:5, Insightful)
1 byte. you mean 1 byte.
No. He means one bit. One byte (8 bits) is completely uncompressed. But English text will compress down by nearly 90%, which leaves about 1 bit per character.
The best compression ratios are for large texts using a consistent writing style and vocabulary, so tweets would yield less than 90% compression, but would likely still be better than 85%.
Re:How much data is that per year? (Score:4, Informative)
I'll throw in one more data point. I developed a predictive text entry database for my previous employer - similar to the old T9 ( better obviously since I was involved ;-) ) and for English (and similar languages) it would take about 4 bits per dictionary word you trained (which is less than 1 bit/char since the average word length is a bit over 5). It is worst-case as we are talking about a dictionary, so no repeating words etc that compress a lot - however the information about how long a word is is not included in those 4 bits, so you save there (the way to think about it is that the user provides the length of word knowledge, the linguistic db the rest).
But the idea is that English is pretty compressible...
Re: (Score:2)
Re: (Score:2)
Well, English belongs to the second most compressible group for our technology, along with languages like Spanish and (maybe) German. There was one group that was even more compressible (e.g. Finnish, Italian) - by about 10-20%. Arabic was bad, took twice the space per word, although with only one client asking for it I never tried to see if I could optimize for it. I don't remember Hebrew and I don't see a built db on my disk to extrapolate. Chinese is a whole different story since you store pronunciations
English is about 1 bit per character (Score:4, Informative)
Shannon's paper "Prediction Entropy of Printed English [google.com]" tries to measure the amount of information per character in English.
He found that English is about 1 bit per character, and so compressed text can be expected to take up about that much room.
The paper is a pretty interesting read if you have read his 1948 paper that defines entropy (also a good read).
He came up with some interesting experimental methods to measure entropy in English.
Re: (Score:2)
This is TWITTER!! The only way to measure its information content is to use negative entropy.
Re: (Score:2)
That is obviously completely optimal compression which is practically impossible to obtain. Also, Twitter is mostly non-English (spelling mistakes, emoji characters etc).
Re: (Score:2)
Re: (Score:2)
1 byte. you mean 1 byte.
1 bit per character is what I mean you ignorant twat.
Re: (Score:2)
Re: (Score:2)
Not sure of the exact amount, but it's less than one Library of Congress' worth.
Re:How much data is that per year? (Score:4, Insightful)
I say it will average 1 Library of Congress to store a Library of Congress worth of data.
Re: (Score:1)
I say it will average 1 Library of Congress to store a Library of Congress worth of data.
That depends on your temporal frame of reference.
Re: (Score:2)
So by those you are at around 30 GB per day, with increase in twit size and its increased usage in the four years lets double that so probably 60 GB per day, ignoring indexes, metadata, linking to users, etc.
Re: How much data is that per year? (Score:2)
Why did they do this to begin with? (Score:2)
Twitter is little more than a digital version of some a-hole writing something on the wall of a public restroom. Mostly a collection of advertisements and banal BS. It's not like we have someone writing profound tresses on the human condition there.
Hell.. personally I really believe that the entire act of doing this was nothing more than a giant advertising campaign for Twitter using former President Obama's connection to the media.
Re: (Score:3)
Twitter is little more than a digital version of some a-hole writing something on the wall of a public restroom.
Along the line of a-holes on Twitter..
Wasn't it established in a federal court that Trump's tweets amount to official statements and can be cited as effective policy statements? Further, I recall they must be preserved by the official records act.
I don't have the citation handy and don't remember what venue it was but perhaps someone else here can post it.
Re: (Score:2)
Wasn't it established in a federal court that Trump's tweets amount to official statements and can be cited as effective policy statements?
Established that they are official statements? No, but that was claimed earlier this year [nbcnews.com] by Sean Spicer, the White House press secretary at the time.
Re: (Score:2)
Established that they are official statements? No, but that was claimed earlier this year [nbcnews.com] by Sean Spicer, the White House press secretary at the time.
More than just Spicer. The 9th circuit said [slate.com], basically, that they were pretty much the same stature as executive orders:
Re: (Score:2)
The 9th circuit said [slate.com], basically, that they were pretty much the same stature as executive orders
If you read the actual article, instead of just the first sentence which you cited, you'll see that the 9th circuit claimed nothing of the kind. They merely cited a tweet in the context of their ruling that Trump exceeded his statutory authority, and that he had no rationale for his decision:
Indeed, the President recently [tweeted] his assessment that it is the “countries” that are inherently dangerous, rather than the 180 million individual nationals of those countries who are barred from entry under the President’s “travel ban.”
So, the court merely made note of one of his tweets, even though they mocked it. But the court paying attention to them does not elevate them to the status of executive orders. If that were true, then God help us. What
Re: (Score:2)
Some things never change. [pompeiana.org]
Re:Why did they do this to begin with? (Score:5, Insightful)
"Twitter is little more than a digital version of some a-hole writing something on the wall of a public restroom."
Nonetheless historians are studying the graffiti on the walls of Pompeii and Herculaneum.
https://www.smithsonianmag.com... [smithsonianmag.com]
Re:Why did they do this to begin with? (Score:5, Insightful)
Mostly a collection of advertisements and banal BS.
In hindsight, it is often the banalities that are the most interesting. Archaeologists often learn more from looking at ancient garbage dumps than from excavating palaces.
Re: (Score:2)
It's not like we have someone writing profound tresses on the human condition there.
For profound treatises, look hair on Slashdot.
Who knew? (Score:4)
I'm actually more surprised this data collection has gone on at the Library of Congress since 2010, than the news that it's ending.
Now if the story had started with "The NSA...", I would've been quite shocked at its termination.
Posterity (Score:5, Funny)
Archaeologist 1: "Hey, I just discovered a message broadcast by the leader of the once great empire, United States!"
Archaeologist 2: "Marvelous! What's it say?"
Archaeologist 1: 'Let's see..."Rosie O. looks like a horse farted out a prune. Disgusting loser, so sad!"'
Archaeologist 2: "On second thought, let's pretend we never found it."
Re: (Score:2)
Those are not lies, they are "colorful and whimsical interpretations of events and alternative realities".
Re: (Score:1)
so it's as if 1984's Big Brother died and left his brother, Big Bozo, in charge.
Re: (Score:2)
I think the article writer was trying to be PC, and didn't add "saved for future generations", as a warning to society.
Redundancy (Score:1)
The government realized it was useless hosting this data at both the Library of Congress AND the NSA datacenter.
Re: (Score:2)
I would expect the rambling of famous such as William Shatner , would be saved, even if it some of it is rather odd. Also I would say a random sample from every day people should be saved, just as representation of the times. However everyone cat video, and personal rambling of their political belief probably shouldn't be bothered as it would be a waste of space.
All this means is... (Score:1)
All this means is that, in 5000 years, historians will get a dangerously stilted view of what was posted on twitter - if Trumps tweets are archived, but not the tweets that debunks his claims, then his claims will stand unopposed for future historians to debate about.
How do you mean, exactly? (Score:2)
All this means is that, in 5000 years, historians will get a dangerously stilted view of what was posted on twitter - if Trumps tweets are archived, but not the tweets that debunks his claims, then his claims will stand unopposed for future historians to debate about.
How do you mean, exactly?
I'm wondering what danger there could be (5000 years from now), and if we should take steps to avoid it.
Re: (Score:2)
I think it would be a better idea to just fire him and put in someone who isn't a completely worthless fucking pile of despicable immoral dishonest perverted shit.
Re: (Score:1)
Laws and movements are often wildly misconstrued from their original meanings, because those meanings are not properly explained and laid out to start off with.
Look at how much legal effort has gone into interpreting the US Constitution, with significant legal arguments hinging on commas etc. And that document is written in a language which is still spoken.
I take it you've heard the story about how Nero played his fiddle as Rome burned? Yup, that's just one of the versions of how it went down - but it's th
Re: (Score:2)
All this means is that, in 5000 years, historians will get a dangerously stilted view of what was posted on twitter - if Trumps tweets are archived, but not the tweets that debunks his claims, then his claims will stand unopposed for future historians to debate about.
Are you saying that Twitter is the only place where people debunk false claims made by politicians?
Re: (Score:1)
Look around at what has survived the ages so far, and tell me that there isn't a decent chance that quite possibly the twitter archive might be the only thing on certain topics which survives the next few ages - it might not even survive intact.
Why take the chance? Archive everything, or nothing. Archiving the tweets of someone known to be toxic while relying on other external sources to debunk that toxicity shouldn't be a strategy to rely on.
Do you really want Trump to be seen as the voice of reason by de
Re: (Score:2)
Re: (Score:2)
Trump is still in power and waging a war against several mainstream news outlets.
Still feel confident?
Also, I'm sure no one thought that the best record of several dead languages would turn out to be a stone establishing a religious cult and granting tax exemption status to its priests (some things never change). I'm sure the creator of the decree on that stone would have thought his religion would have lasted longer than the stone itself, and yet here we are...
Re: (Score:2)
Looks like the Trump cock suckers are out in force, downvoting things they don't like :D
Proof the U.S. has time-travel technology (Score:2)
This is the best proof yet that the U.S. possesses time-travel technology!
How else does the Library of Congress know which tweets will be of historic significance?
tweets for twits (Score:1)
Wasted US FED money (Score:2)
But US Gov. is a financial train wreck. We just passed "TAX CUT" with expected 1 trillion added to deficit, and we find out how wasteful stuff like this happens all around us.
Library of Congress is part of Judicial branch with budget of about 700M USD.
Re: (Score:2)
And you somehow believe that saving money on the 1/3 of the budget that is discretionary spending is going to save us from the 2/3 that is non-discretionary? You might get a few mill out of that for killing tweets.
While we're on the subject tilting at windmills, the entire foreign aid budget is less than 50 Billion. Saving that won't help and will probably cost money in the long run due to the programs being canceled in countries we'd really like to stand up and not fall over to the local nutjobs.
Re: (Score:2)
Re: (Score:2)
The tax cut had a 1.8 - 2+ trillion dollar price tag. Don't worry though, they've already passed more legislation increasing the cost by at lease another 200 billion. So, it's at least a 2 trillion bill (over the next decade alone). And that's if they let the middle class tax cuts expire in 2023 as planned.
Woot! (Score:2)
I know why LOC is doing this (Score:2)
Apparently a single public employee is monopolizing the service and sucking up all the storage space.