Online Document Search Reveals Secrets 271
An anonymous reader writes "New Scientist is reporting that many documents published online may unintentionally reveal sensitive corporate or personal information, according to a US computer researcher. Simon Byers, at AT&T's research laboratory in the US, was able to unearth hidden information from many thousands of Microsoft Word documents posted online using a few freely available software tools and some basic programming techniques." Update: 08/16 19:06 GMT by H : The story is originally from Crypto-gram, not New Scientist.
Nothing New (Score:5, Insightful)
Re:Nothing New (Score:4, Informative)
"For example, in 2002 the Washington Post published a version of a letter sent by the Washington sniper in Adobe PDF format. Names and telephone numbers were visibly blacked out, but still found embedded in the file."
Re:Nothing New (Score:5, Funny)
which is why you should use latex [acm.org]! nobody understands that stuff. security through obscurity!
Re:Nothing New (Score:2)
$ strings filename.doc
It works for me.
Re:Nothing New (Score:5, Informative)
That is because the people who published the PDF were idiots.
Acrobat has a number of commenting tools. What the Washington Post staff did in that case was use the Highlight tool, set the color to black, and use it to draw over the names.
Only problem? The highlighter is an object that is drawn on top of the text object it is attached to. The underlying text is not modified at all. In fact, if you watch closely, you can see the name for a split second before the renderer draws the highlights.
If the Washington Post had used the TouchUp Text tool to delete the names, the information would not have been leaked.
Nathan
Re:Nothing New (Score:2)
Acrobat has a number of commenting tools. What the Washington Post staff did in that case was use the Highlight tool, set the color to black, and use it to draw over the names.
And that makes you an idiot? Not tech savey, maybe, but that's the exact thing you'd do in releasing hardcopy, and unless you think in terms of the internals of a computer, there's no reason you'd think twice about doing that.
Re:Nothing New (Score:2)
Or is competence no longer a job requirement?
Re:Nothing New (Score:3, Interesting)
Yes, we were idiots. I work for the Post in a limited degree and we now have a sheet of paper on a quite visible bulletin board describing how we were idiots.
The
Re:Nothing New (Score:2)
Edit in vi
Run over custom script to add basic HTML
Works for me
or Just use LaTex
Rus
Re:Nothing New (Score:2)
Anyway, you don't need the cat -- strings filename does what you want.
Re:Nothing New (Score:2)
OH NO! (Score:3, Insightful)
This isn't just nothing new, it's old news. Wasn't this how they caught the guy who wrote the melissa virus? When that little popup window from MS Office came up asking for their personal info, did they just think Office was trying to get to know them better, in order to be their friend?
It's just silly pressmongering. Those dumbasses have to come up with a terrifying computer factoid every day, or the ignorant compu-phobes they prey on might come to their senses.
Just my o
eh? (Score:3, Interesting)
of course you could always try http://searchpdf.adobe.com/
Now there's a way to search through more than a million summaries of Adobe(R) Portable Document Format (PDF) files on the Web. Your search results will allow you to see the summaries before deciding to view the original Adobe PDF.
WHAT?!?? (Score:5, Funny)
From the article:
I just created a Word document, blah.doc and put some text into it. I made sure I had a couple of undo points. I closed it and opened it back up, I couldn't undo SHIT. So where the hell am I being granted this mysterious "convenience?"
I know that the guy stressed the fact that Micrsoft isn't alone in this disctinction, but this is just another example of why Microsoft SUCKS.
I put the doc in a samba share and viewed it with vi. I found the path to the doc, the original name, my userid on my laptop, and the company name. All were hidden from the simple searches like this:
s.l.a.s.h.d.o.t...o.r.g
WTF?!?
Oh, WAIT a minute! This is also from the article:
WHEW! I feel so much better. Please disregard the first six paragraphs. Thanks.
Re:WHAT?!?? (Score:2)
It's not hidden. It's unicode (double-byte), just that;
Re:WHAT?!?? (Score:5, Funny)
I.
Re:WHAT?!?? (Score:2)
Thanks
Re:WHAT?!?? (Score:2)
You have to turn on the "Track Changes" (under "tools") feature and then make some changes (then save it, etc.)
Why Word Does This (Score:5, Informative)
You're not.
There are two ways of saving a word document:
Fast Save dumps the binary from memory into the file. Full Save compacts the binary image, and reorders it. This takes time.
Word's text stream is stored using a piece table [unm.edu]. One of the benefits of a piece table is that if you keep the meta information about the text, you can get nearly infinite undo. The way it does this is by having an original data stream, and an appended data stream. Whenever you add data to the file, it gets added as a chunk to the end of the appended data stream. Whenever you delete, the meta table is updated to remove the text from the stream, but otherwise the text itself is left unaffected.
As a result, text is never removed from the document. A Fast Save (which is the default) under Word dumps the Piece Table as-is (there is probably some compaction over time to remove the no-longer-used data, but it probably only occurs above a given threshold of used to unused text). A full save deconstructs the piece table's meta information, and turns it back into one contiguous stream of data.
It's all just a function of the way the text is stored while it's being edited. Different editors have different mechanisms; some store data based on lines, and some store it using a gap buffer. But ultimately, the problem exists because Word uses a piece table, and it dumps the entire table to a file by default.
It's actually a sensible way of handling the text data. However, whoever designed the Fast Save algorithm probably didn't consider the ramifications of the text still being stored in the document. The best workaround? Wipe the unused sections of the piece table. But then you might as well return to using a Full Save, as you'll be ditching the performance benefits anyway.
Simon
Re:Why Word Does This (Score:3, Insightful)
At the moment the user hits "save", "fast save" is faster because Word doesn't has to do any re-interpreting of what is already in memory. This step is what makes full save slower. But the re-interpreting doesn't has to happen at the moment the user hits "save", it can happen all the time while the user is
Re:WHAT?!?? (Score:5, Insightful)
"You only have the convenience while the file is open. If you could undo after you re-opened a file, these "hidden secrets" wouldn't be hidden at all!"
Exactly. I knew that to begin with, but I did it and then vi'd the file to confirm. If I delete text from a document, that means I don't want that text in the document. Neil Laver says "...hidden information can "incredibly useful" in improving the functionality of the software."
So my main point is, if I am being supposedly CONVENIENCED by this "feature," HOW is the software helping me by storing these things in my document?
Word attachments (Score:2)
Prediction (Score:2, Insightful)
Re:Prediction (Score:3, Insightful)
I had no idea that the sloppy handling of non-displayed data in output files (not just Word, mind you), and their publication on the web was actually Another Way For The Man To Keep Us Down...
Re:Prediction (Score:3, Insightful)
I recall an article (possibly here) about companies using this "feature" on job applicants to read what was in previous versions. For example, you overwrite this letter
with
It would be easy for MS to see that you are asking IBM for $10,000 less. Letter-writing skills notwithstanding, I don'
if anything, the opposite (Score:3, Interesting)
Aside from the paranoia overtones, I still disagree. The tools for doing this are on the web. Right now. So in other words, a w
It's been said hundreds if not thousands of times: (Score:5, Insightful)
Lots of people put sensitive documents in public webspace, primarily because they don't know any better. Eventually the cost-benefit analysis will be done, and corporations will pay to have their users trained. Until then, this type of thing will continue to happen.
Re:It's been said hundreds if not thousands of tim (Score:5, Insightful)
So a relatively security-conscious person who just doesn't know anything about Word file formats could easily publish something online on purpose without knowing that there is (invisible) sensitive information in it, even if they'd never put that information in a public place on purpose.
[TMB]
It comes down to one question (Score:2)
I thought this was common knowledge? (Score:5, Interesting)
Well, it is amongst people who object to being mailed Word documents, anyway. They're just a really bad format for publishing information in.
See Richard Stallman's [gnu.org] 'no-word-attachments' article, for example...
Re:I thought this was common knowledge? (Score:3, Interesting)
Hell, this is how slashdot figured out that the Microsoft Switch [slashdot.org] was a fake.
Re:I thought this was common knowledge? (Score:2, Funny)
My friend go so tired of people on his team sending him word docs, that he learned TeX and started sending his replies that way. When he feels really nasty about it, he sends the .dvi files.
An Important Question (Score:4, Interesting)
For example does OpenOffice/StarOffice and other open source programs have the saem security problem?
Re:An Important Question (Score:5, Informative)
It does, however, save things like when the document was last printed, how often it has been edited and by whom, etc. unless you tell it otherwise. It's easy to get rid of the data (there is a huge "Delete" button in the properties dialog), but not many people will be aware of it.
So, basically, if you don't know what you are doing, you could give out more information than you want to with you OOo files.
Re:An Important Question (Score:2)
bsdtype license = free to steal
Re:An Important Question (Score:2)
bastype license = freedom for coder who would like to make that code their own.
As to which is free in the sense of no cost for those who want to use it, without a doubt it's BSD. Those who release gpl'd code aren't working for free, they want their payment in code instead of cash. Considering human nature, I find the gpl to be an ideal that is definattely in closser harmony with a man, and at the same time still an ideal of what will benefit mankind.
But
Re:An Important Question (Score:3, Informative)
In principle there is no problem using that with any version management system, CVS, RCS, Subversion etc should work fine with it. You'll be more happy to have an XML-aware diff at least, though - my simple test doc ended up with all content in a single long line.
Well... (Score:3, Funny)
Are you going to share that info or what?
Throw it up on freenet man!
infrastructure data? (Score:2, Offtopic)
An accomplished searcher can learn much about the world we live in, as slashdot reported some time ago [slashdot.org].
An interesting reminder, to be sure, given yesterday's blackout [slashdot.org].
Makes a guy wonder just how much is still available regarding key electrical and telephone infrastructure. Emergency power capabilities of broadcasters (radio, television, mobile phone). Gas lines, in the parts of the country that have them. Water systems. There's likely a bunch of data out there, ready to be mined.
LaTeX (Score:4, Funny)
Re:LaTeX (Score:5, Funny)
OMG (Score:3, Funny)
How long until someone blames Microsoft, I wonder...
Re:OMG (Score:2)
True story. (Score:5, Interesting)
Anyway, I have to admit that I was also burned by word. I was in the habit of opening the last memo I wrote from the recent documents list and using it as the starting point for newer ones. At some point, I put a bunch of policy statements on a CD and was later told that everyone was reading the hidden text. Doh!
This was back in the days of office 97 I believe. I'm not sure if Office 2k or XP still have this feature/bug.
Re:True story. (Score:2, Informative)
because they select "send document" form the file menu and get a blank email with the document attached
Re:True story. (Score:4, Interesting)
Dang... (Score:5, Funny)
Tools (Score:2)
Rus
Job Recruiters (Score:5, Interesting)
Re:Job Recruiters (Score:3, Funny)
A colleague on the review team who didn't use Windows turned to strings(1) to get the data from these documents, which yielded us the information that a *lot* of this guy's other prospects were also his current top choice. M
Helpful Hint (Score:5, Funny)
Remember kids: strings is your friend. If you happen to get a job offer in the form of a Word document and the HR drone who sent it to you wasn't careful, you can often see the version that got sent to other candidates and, more importantly, how much money they were offered. It can do wonders for your bargaining position.
Re:Helpful Hint (Score:3, Informative)
XRAY is also handy for pulling text out of executables. Frex, a brief rant about upper manageme
Not just documents (Score:3, Informative)
Clippy did it (Score:5, Funny)
Would you like to...
1. Divulge corporate secrets?
2. List your passwords?
3. Remove KB823980 and open port 135?
It looks like your trying to close Clippy.
Would you like to...
1. Shit in your hat?
2. Put fist through bling bling flat panel?
3. Go home for teh weekend?
Check this out... (Score:4, Informative)
Near the bottom there is often information from other documents of the sender that they were recently working on. I don't know why it saves this. Maybe something to do with the undo buffer?
At work I used to look at internal memos that would be sent out on a weekly basis and find out all sorts of other stuff that was going on.
My 2c.. and a terrible pun. (Score:5, Interesting)
One of my clients was recently caught out when google indexed private metadata she didn't know was still there, so I can well understand the gravity [google.com] of this situation.
Re:My 2c.. and a terrible pun. (Score:2, Interesting)
If you didn't try that 'gravity' link in the parent, check this out [google.com]. Google calculator -- takes input in standard algebraic format, and knows some variables and units too (such as "G" being the universal gravitational constant, "mass of earth", and "radius of earth"), so you can just use the variable name and google fills in the values, converts units as needed, and gives a numeric result. Nice.
However, unless I'm doing something w
Re:My 2c.. and a terrible pun. (Score:2)
i have my own special program that does this... (Score:4, Funny)
it's called http://www.google.com and you search by "top secret documents filetype:doc" [google.com].
It's easy... (Score:5, Informative)
Re:It's easy... (Score:2)
I'm ashamed to say that I never even thought of that one.
Thanks for the it, Now I have a chance of completing my bootleg collection of Pink Floyd albums
Re:It's easy... (Score:2)
Index of
Well, yeah, that was the whole idea...
I hate to state the obvious but, (Score:2, Insightful)
What will it take? What happens when a script kiddie hacks a hospital and shuts down the life support systems in ICU? Or just juggles the meds for the patients so that everyone in the hospital gets the wrong meds?
Or perhaps they glitch the Air Traffic Control system and airplanes rain down from the sky and tens or hundreds of thousands of people die??
Before the last war in Iraq started they showed the "state
Don't worry (Score:3, Interesting)
Re:Don't worry (Score:2)
Another way to find "secret" data (Score:4, Informative)
statistics?
Adding a simple
with a combination of "web" or the name of any of
the common statistic generation programs gets you
access to the statistics of a *lot* of websites.
Then from the stats you could find any "hidden"
data which is not linked on the site including
internal company documents, girlfriend's nude
photos or mp3s.
Alternately you could just google for the
statistic reports of sites and get there
more easily.
This is another case of ill informed or lazy
users not following what should be a simple
security policy which could cause serious
repercussions.
For those who want to know how to protect
yourself, read this link [apache.org].
Word documents are stupid. (Score:2)
But instead of explaining it all technical and telling people how they can strip private information, you should use Microsoft's own techniques of FUD against them by telling people that Microsoft Word files contain all their private information and that information is gath
MS Word got Tony Blair busted in the WMD case (Score:5, Informative)
Tony Blair got busted in the WMD case because of the names of the people who revised the WMD Documents were still in the Word file. Now, it seems, that the Downing Street only puts PDF files on the web - and has removed all the MS word documents that were already there ....
Tools reveal secret life of documents - Documents like in Word save too much Info - Blair Episode [bbc.co.uk]
By Mark Ward
July 03, 2003
The UK Government was just the latest in a long line of organisations that has learned to its cost just how much information can be gleaned from innocent looking files. Earlier this year it issued a document called the 'dodgy dossier" about Iraq's concealment of weapons of mass destruction that was written using Microsoft Word. Every Word document remembers who made the last few revisions to it. The log reveals the names of four of the people who prepared the Iraq document for publication and the government Communications Information Centre that some of them work for. It was this log that Number 10 press chief Alastair Campbell had to explain to the House of Commons Foreign Affairs Select Committee in late June as part of its investigation into the Iraq dossier's history. Some of this information can be seen simply by right-clicking to view the properties of the downloaded document in a file listing. Utility programs can get even more information from Word revision logs.
The life stories of the documents we create are becoming increasingly important as the scrutiny of industries and governments gathers pace. Every time you write or edit these files you leave a trail of information revealing what you did and when you did it. With the right tools it is possible to extract this data and work out the trail of authors and workers who created a document. That is why we should all use opensource and open data formats - so that we can humanly read what all we are "putting" into the document. The Word version of this document has now been removed from government websites but copies of it are still available elsewhere on the net.
Unabridged and unedited article at
http://news.bbc.co.uk/2/hi/technology/3037760.stm
Here's the doc (Score:2)
Here's a copy of the document [computerbytesman.com]. Should save anyone else the trouble of googling for it </karmawhore>.
You belive the FBI ? ? Re:MS Word got Tony Blair (Score:2)
Well, interesting that you should hit BBC where it hurts, an in some cases maybe they deserve it. But, I think this "story" that you talk of is not one of those. Look at my take of the story behind the story behind the story .... The Blair-Bush team is more devious than you might want to believe ...
ALL Quotations below are from the MSNBC article ... my comments are in the [TT] [/TT] format ....
At the end of the day, officials say, the Lakhani case remains a story about potential threats and not the rea
The British experience - government stupidity (Score:3, Informative)
DMCA violation? (Score:4, Interesting)
By using tools that break the "encryption" on, for examply, the Washington Post .pdf file mentioned in the article, isn't the researcher violating the DMCA? Isn't his whole project bragging about doing this, a la 2600?
I hope he remembers a few packs of cigarettes in order to buy himself a few nights of sleep in the Big House.
Didn't I already write about something similar? (Score:4, Interesting)
This not anything new. (Score:2, Interesting)
Newsflash: People shoot themselves in feet (Score:3, Informative)
At the risk of being moderated Troll and Redundant,
Why are these people posting Word Documents online?
The Word Wide Web is not the Microsoft Wide Web.
Post in plain ASCII text, or HTML if you feel the need to pretty it up.
People keep using tools that are far more powerful and complex than they need, then they screw up, and blame the tools. Pick a simple tool to do a simple job, and you don't need to worry about your ignorance of the tools you are using causing you problems.
UK govt caught out (Score:3, Interesting)
This has happened to the UK government several [theregister.co.uk] times [computerbytesman.com]. The latter link shows whose sticky fingers were on the infamous "dodgy dossier".
Gareth
It Must Be A "Technological Measure" (Score:2)
Heh (Score:3, Funny)
Apparently they need to use some of the software he used to get a conjugation of the infinitive "to be" back into their text.
Sanitizer (Score:2)
But for those who don't want to change, is there a "Word sanitizer" tool available? Something that will convert one Word doc to another, minus the hidden text?
up to parent directory leaks (Score:2)
I still find user accounts on which if you do a manual "up to parent directory" and the user has no index.htm{,l} file, you often get a fully navigable listing of their entire html directory.
Sometimes you find personal files that were never directly linked to, nor intended to be.
Word doc Cleaning Program? (Score:3, Interesting)
Microsoft's article on reducing MS Word metadata (Score:5, Informative)
It has been known for a long time that metadata are hidden within Microsoft Word documents. Microsoft even has Knowledge Base article 237361 [microsoft.com] explaining how to reduce the amount of metadata appearing in MS Word 2000 documents. Here's an excerpt:
This step-by-step article explains various methods that you can use to minimize the amount of metadata in your Word documents.
I'll bet there are more, but they won't disclose them.
It's a pity that more people don't just save as RTF. It's just as good for most uses, and it's a less obscure format.
Not Just for Fast Save Anymore? (Score:2)
Back in the days of Word 5.1a (the last good version), I recall hidden data only getting saved if you used Word's "Fast Save" feature. Since Fast Save wasn't measurably faster, I turned it off. Is this no longer the case? (A quick look through the preferences panel in my copy of Word reveals a Fast Save option; it's turned off.)
Schwab
Best way to avoid this (Score:2)
Copy your final text from your working draft into a brand new document. Yep good ol' copy and paste. You will only copy the selected text. All the auto-save data and edit history will not be copied into the new document. If your document has charts/graphs/placed images, etc. You will need to do a select all to be sure you got it.
If you always do this for final drafts you won't ever have a problem again. If in doubt of whether your current copy is clean.. just
There are advantages! (Score:5, Funny)
They tracked changes. All we needed to do was display them... and we got juicy stuff like "if they accept either our fix for clause X or for clause Y we can still s---w them royally in scenario Z".
Made for a very effective negotiation. For us.
Oh, wait, the article was about the problems this raises for the document's _author_.
Never mind
Re:crypto (Score:2)
p
Re:crypto (Score:4, Funny)
Re:crypto (Score:2)
Re:crypto (Score:3, Insightful)
"It is feasible that an individual may include their social security number on copies of a resume sent to prospective employers, but delete it from the version put online to guard against identify theft," Byers writes.
Who in their right mind puts their SSN in any version of a resume??!
Re:crypto (Score:3, Informative)
I exported it to RTF then reimported before saving it again as .doc. This erased other people's access to my thought processes, and it reduced the file size by 80% to boot.
In the end it didn't matter much, though. I usually include a plain text version of the resume right in my email as a backup along with the .doc attachment. On interviews, I'
in html (Score:2)
To do this yourself, just type:
<a href="http://foo/">bar</a>
Re:crypto (Score:2)
Re:P2P has be doing htis for a long itme (Score:3, Interesting)
Although I cannot guess how many of those are honeypots.
Re:P2P has be doing htis for a long itme (Score:2)
Search for *anything* on Kazaa, and there are always some files found. However, upon closer inspection, you'll see that they really are hidden pornography URLs, viruses, and other poison payloads, definitely not what you were really looking for.
Re:Slow news day (Score:2)
Other human beings?
Re:What exactly's the big deal here? (Score:3, Informative)
He did the search, as you said, but he didn't use Google's conversion; instead, he looked directly inside the DOC file, where Word keeps a bunch of information for its own purposes -- stuff that was deleted, stuff that was just in the wrong memory location when the save happened -- whatever.
He found legitimate docs, with legit contents; but they also contained some stuff that the authors didn't intend to publish.
-Billy
Re:What exactly's the big deal here? (Score:2)
Sort of like deleting password emails from Outlook Express, and then having someone retreive them from the
Re:What exactly's the big deal here? (Score:2)
Re:Old News! (Score:2)
Re:Here's An Idea (Score:2)
oh wait...