Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Privacy Microsoft Your Rights Online

Online Document Search Reveals Secrets 271

An anonymous reader writes "New Scientist is reporting that many documents published online may unintentionally reveal sensitive corporate or personal information, according to a US computer researcher. Simon Byers, at AT&T's research laboratory in the US, was able to unearth hidden information from many thousands of Microsoft Word documents posted online using a few freely available software tools and some basic programming techniques." Update: 08/16 19:06 GMT by H : The story is originally from Crypto-gram, not New Scientist.
This discussion has been archived. No new comments can be posted.

Online Document Search Reveals Secrets

Comments Filter:
  • crypto (Score:1, Informative)

    by Feyr ( 449684 ) * on Friday August 15, 2003 @05:34PM (#6708172) Journal
    funny how the lastest cryptogram treats of exactly the same subject, i just received it an hour ago

    http://www.schneier.com/crypto-gram.html
  • Re:Nothing New (Score:4, Informative)

    by Sky-217 ( 44374 ) <marter+slashdot@gm a i l . c om> on Friday August 15, 2003 @05:40PM (#6708217)
    In the article they mentioned that this applies to pdf files too...

    "For example, in 2002 the Washington Post published a version of a letter sent by the Washington sniper in Adobe PDF format. Names and telephone numbers were visibly blacked out, but still found embedded in the file."
  • Not just documents (Score:3, Informative)

    by I8TheWorm ( 645702 ) on Friday August 15, 2003 @05:54PM (#6708308) Journal
    It doesn't pertain to just documents. I've seen code samples posted to sites like experts-exchange where DB connection strings still had UID and PW data in them. Seems people don't re-read before they post very often.
  • Check this out... (Score:4, Informative)

    by Geminatron ( 616988 ) on Friday August 15, 2003 @05:57PM (#6708329)
    View some of the past word docs you've received in a hex editor...

    Near the bottom there is often information from other documents of the sender that they were recently working on. I don't know why it saves this. Maybe something to do with the undo buffer?

    At work I used to look at internal memos that would be sent out on a weekly basis and find out all sorts of other stuff that was going on.
  • Re:True story. (Score:2, Informative)

    by DrSkwid ( 118965 ) on Friday August 15, 2003 @06:00PM (#6708343) Journal
    why do people send email messages that just say "see attached file"

    because they select "send document" form the file menu and get a blank email with the document attached

  • It's easy... (Score:5, Informative)

    by inertia187 ( 156602 ) * on Friday August 15, 2003 @06:06PM (#6708375) Homepage Journal
    This is the easy way:
    "Index of" "Name Last modified Size Description"
    Then you add file extensions or other things. For example:
    • mpg [google.com]
    • mov [google.com]
    • mp3 [google.com]
    • secret [google.com] - doesn't have to be file extensions...
    • "My Documents" [google.com] - yeah, that's secure...
    • etc
    Anyway, as you can see, it's pretty effective. Sometimes admins wise up, and all you have is the Google cache. But sometimes they don't, and you get to look. Thanks Google!
  • by cnb ( 146606 ) on Friday August 15, 2003 @06:14PM (#6708412)
    How many people actually protect their website
    statistics?

    Adding a simple /stat/ or /stats/ or a variation
    with a combination of "web" or the name of any of
    the common statistic generation programs gets you
    access to the statistics of a *lot* of websites.

    Then from the stats you could find any "hidden"
    data which is not linked on the site including
    internal company documents, girlfriend's nude
    photos or mp3s.

    Alternately you could just google for the
    statistic reports of sites and get there
    more easily.

    This is another case of ill informed or lazy
    users not following what should be a simple
    security policy which could cause serious
    repercussions.

    For those who want to know how to protect
    yourself, read this link [apache.org].
  • Re:Nothing New (Score:5, Informative)

    by gblues ( 90260 ) on Friday August 15, 2003 @06:17PM (#6708427)

    That is because the people who published the PDF were idiots.

    Acrobat has a number of commenting tools. What the Washington Post staff did in that case was use the Highlight tool, set the color to black, and use it to draw over the names.

    Only problem? The highlighter is an object that is drawn on top of the text object it is attached to. The underlying text is not modified at all. In fact, if you watch closely, you can see the name for a split second before the renderer draws the highlights.

    If the Washington Post had used the TouchUp Text tool to delete the names, the information would not have been leaked.

    Nathan

  • Tony Blair got busted in the WMD case because of the names of the people who revised the WMD Documents were still in the Word file. Now, it seems, that the Downing Street only puts PDF files on the web - and has removed all the MS word documents that were already there ....

    Tools reveal secret life of documents - Documents like in Word save too much Info - Blair Episode [bbc.co.uk]

    By Mark Ward

    July 03, 2003

    The UK Government was just the latest in a long line of organisations that has learned to its cost just how much information can be gleaned from innocent looking files. Earlier this year it issued a document called the 'dodgy dossier" about Iraq's concealment of weapons of mass destruction that was written using Microsoft Word. Every Word document remembers who made the last few revisions to it. The log reveals the names of four of the people who prepared the Iraq document for publication and the government Communications Information Centre that some of them work for. It was this log that Number 10 press chief Alastair Campbell had to explain to the House of Commons Foreign Affairs Select Committee in late June as part of its investigation into the Iraq dossier's history. Some of this information can be seen simply by right-clicking to view the properties of the downloaded document in a file listing. Utility programs can get even more information from Word revision logs.

    The life stories of the documents we create are becoming increasingly important as the scrutiny of industries and governments gathers pace. Every time you write or edit these files you leave a trail of information revealing what you did and when you did it. With the right tools it is possible to extract this data and work out the trail of authors and workers who created a document. That is why we should all use opensource and open data formats - so that we can humanly read what all we are "putting" into the document. The Word version of this document has now been removed from government websites but copies of it are still available elsewhere on the net.

    Unabridged and unedited article at

    http://news.bbc.co.uk/2/hi/technology/3037760.stm

  • by pyrotic ( 169450 ) on Friday August 15, 2003 @06:22PM (#6708449) Homepage
    Have to post a link to this famous example, the dodgy dossier. [casi.org.uk] There was a writeup here [computerbytesman.com]. If you're thinking of making the case for war, don't release Word documents to the press - unless they're very very docile.
  • Images too! (Score:0, Informative)

    by Anonymous Coward on Friday August 15, 2003 @06:22PM (#6708452)
    Cat Schwartz, of TechTV fame, discovered that cropped JPEG images may also contain uncropped thumbnail images [fuckallyall.com] (warning: PG-13 content). There's some debate whether the images in question came from Photoshop or from a thumbnail image stored by the digital camera, but it was a humbling oversight in either case.
  • by William Tanksley ( 1752 ) on Friday August 15, 2003 @06:22PM (#6708453)
    Incorrect. You didn't read the article.

    He did the search, as you said, but he didn't use Google's conversion; instead, he looked directly inside the DOC file, where Word keeps a bunch of information for its own purposes -- stuff that was deleted, stuff that was just in the wrong memory location when the save happened -- whatever.

    He found legitimate docs, with legit contents; but they also contained some stuff that the authors didn't intend to publish.

    -Billy
  • by __past__ ( 542467 ) on Friday August 15, 2003 @06:26PM (#6708469)
    The OOo file format it just a bunch of zipped XML files, you can easily look for yourself. Deleted text is not in it, as it seems. Unless you turned on version tracking, of course.

    It does, however, save things like when the document was last printed, how often it has been edited and by whom, etc. unless you tell it otherwise. It's easy to get rid of the data (there is a huge "Delete" button in the properties dialog), but not many people will be aware of it.

    So, basically, if you don't know what you are doing, you could give out more information than you want to with you OOo files.

  • by darthwader ( 130012 ) on Friday August 15, 2003 @06:50PM (#6708560) Homepage
    ... then suffer foot wounds.

    At the risk of being moderated Troll and Redundant,
    Why are these people posting Word Documents online?

    The Word Wide Web is not the Microsoft Wide Web.

    Post in plain ASCII text, or HTML if you feel the need to pretty it up.

    People keep using tools that are far more powerful and complex than they need, then they screw up, and blame the tools. Pick a simple tool to do a simple job, and you don't need to worry about your ignorance of the tools you are using causing you problems.
  • by 200_success ( 623160 ) on Friday August 15, 2003 @08:01PM (#6709170)

    It has been known for a long time that metadata are hidden within Microsoft Word documents. Microsoft even has Knowledge Base article 237361 [microsoft.com] explaining how to reduce the amount of metadata appearing in MS Word 2000 documents. Here's an excerpt:

    This step-by-step article explains various methods that you can use to minimize the amount of metadata in your Word documents.

    Whenever you create, open, or save a document in Microsoft Word 2000, the document may contain content that you may not want to share with others when you distribute the document electronically. This information is known as "metadata". Metadata is used for a variety of purposes to enhance the editing, viewing, filing, and retrieval of Office documents.

    Some metadata is easily accessible through the Microsoft Word user interface; other metadata is only accessible through extraordinary means, such as opening a document in a low-level binary file editor. Here are some examples of metadata that may be stored in your documents:

    • Your name
    • Your initials
    • Your company or organization name
    • The name of your computer
    • The name of the network server or hard disk where you saved the document
    • Other file properties and summary information
    • Non-visible portions of embedded OLE objects
    • The names of previous document authors
    • Document revisions
    • Document versions
    • Template information
    • Hidden text
    • Comments
    • Metadata is created in a variety of ways in Word documents. As a result, there is no single method to remove all such content from your documents. The following sections describe areas where metadata may be saved in Word documents.

    I'll bet there are more, but they won't disclose them.

    It's a pity that more people don't just save as RTF. It's just as good for most uses, and it's a less obscure format.

  • Why Word Does This (Score:5, Informative)

    by spectecjr ( 31235 ) on Friday August 15, 2003 @08:07PM (#6709213) Homepage
    I just created a Word document, blah.doc and put some text into it. I made sure I had a couple of undo points. I closed it and opened it back up, I couldn't undo SHIT. So where the hell am I being granted this mysterious "convenience?"

    You're not.

    There are two ways of saving a word document:

    • Fast Save
    • Full Save


    Fast Save dumps the binary from memory into the file. Full Save compacts the binary image, and reorders it. This takes time.

    Word's text stream is stored using a piece table [unm.edu]. One of the benefits of a piece table is that if you keep the meta information about the text, you can get nearly infinite undo. The way it does this is by having an original data stream, and an appended data stream. Whenever you add data to the file, it gets added as a chunk to the end of the appended data stream. Whenever you delete, the meta table is updated to remove the text from the stream, but otherwise the text itself is left unaffected.

    As a result, text is never removed from the document. A Fast Save (which is the default) under Word dumps the Piece Table as-is (there is probably some compaction over time to remove the no-longer-used data, but it probably only occurs above a given threshold of used to unused text). A full save deconstructs the piece table's meta information, and turns it back into one contiguous stream of data.

    It's all just a function of the way the text is stored while it's being edited. Different editors have different mechanisms; some store data based on lines, and some store it using a gap buffer. But ultimately, the problem exists because Word uses a piece table, and it dumps the entire table to a file by default.

    It's actually a sensible way of handling the text data. However, whoever designed the Fast Save algorithm probably didn't consider the ramifications of the text still being stored in the document. The best workaround? Wipe the unused sections of the piece table. But then you might as well return to using a Full Save, as you'll be ditching the performance benefits anyway.

    Simon
  • Re:crypto (Score:3, Informative)

    by Waffle Iron ( 339739 ) on Friday August 15, 2003 @09:09PM (#6709627)
    I would never put my SSN on a resume, but the last time I made a resume I ran the .doc file through 'less'. Sure enough, most all of my edit history was in there.

    I exported it to RTF then reimported before saving it again as .doc. This erased other people's access to my thought processes, and it reduced the file size by 80% to boot.

    In the end it didn't matter much, though. I usually include a plain text version of the resume right in my email as a backup along with the .doc attachment. On interviews, I've noticed that most people just print the plain text version. If I really didn't need to make the word doc, and people are too lazy to print it, why do companies insist you send it in .doc format anyway?

  • Re:Helpful Hint (Score:3, Informative)

    by Reziac ( 43301 ) on Saturday August 16, 2003 @01:17AM (#6710599) Homepage Journal
    Or for us DOS folks, there's XRay, last seen floating around Simtel (xray102.zip or something like that, in /textutils). It does a nice job pulling text strings out of any binary, and redirects handily to a file or your fave viewer (frex, LIST). I've used it to retrieve the complete content from a Word document that was hopelessly corrupted, and to see what fun was to be had in another document's "deleted" space.

    XRAY is also handy for pulling text out of executables. Frex, a brief rant about upper management, found lurking inside an .EXE from an old version of Paradox. :)

    Or if you're used to looking at raw binaries, skip the middleman and just use Buerg's LIST, as I do. :)

  • by __past__ ( 542467 ) on Saturday August 16, 2003 @07:13AM (#6711542)
    At least the 1.1 betas have an option to save as "flat xml". The format is basically the same as the zipped one, but uncompressed and in a single file (binary files like embedded images seem to be base64 encoded).

    In principle there is no problem using that with any version management system, CVS, RCS, Subversion etc should work fine with it. You'll be more happy to have an XML-aware diff at least, though - my simple test doc ended up with all content in a single long line.

It appears that PL/I (and its dialects) is, or will be, the most widely used higher level language for systems programming. -- J. Sammet

Working...