Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
United States The Internet Your Rights Online

White House Website Limits Iraq-Related Crawling 837

oscarcar writes "Dan Gillmor is reporting on the White House website's use of its robots.txt file to disable search engines from crawling certain material. Many excluded items in the robots.txt file involve mentions of Iraq, possibly to prevent people from finding changes to past statements and information when archived elsewhere."
This discussion has been archived. No new comments can be posted.

White House Website Limits Iraq-Related Crawling

Comments Filter:
  • by jratcliffe ( 208809 ) on Monday October 27, 2003 @05:31PM (#7322229)
    "the American people should have some say in a situation like went on in Iraq. I didn't vote for the present administration..."

    Oh, so the important thing isn't that the American people didn't vote for the current administration (they did), but that YOU didn't vote for the current adminstration. Sorry, thought we were living in a democracy there for a second, thanks for reminding me that the other 279,999,999 of us don't really matter, it's YOUR opinion that counts.
  • Everything Iraq.... (Score:5, Informative)

    by c_oflynn ( 649487 ) on Monday October 27, 2003 @05:31PM (#7322238)
    It looks like 99% of the stuff related to Iraq is filtered out in robots.txt.

    But not a problem, on google.com I just specify the site by saying 'Iraq site:whitehouse.gov' and it had 14,000 hits... the first one is the root of /infocus/iraq directory (which is dissallowed in robots.txt)
  • by Anonymous Coward on Monday October 27, 2003 @05:33PM (#7322261)
    >I wouldn't be surprised if there aren't a few honeypot pages in there too.

    On the production server of the US presidential home page? I'll go with the other theory :)
  • by Anonymous Coward on Monday October 27, 2003 @05:33PM (#7322263)
    The use of the robots.txt file by crawlers isn't madatory, at no point is it ever enforced, it's merely a curtesy.

    All you'd have to do to continue indexing their site is to write a crawler that ignores robots.txt.

  • Re:Queue somebody... (Score:1, Informative)

    by Anonymous Coward on Monday October 27, 2003 @05:35PM (#7322284)
    Or you could cue him instead. That might make more sense.
  • by Chris Parrinello ( 1505 ) * on Monday October 27, 2003 @05:40PM (#7322358)
    Nope... didn't take me long to find something that was disallowed to be a valid URL:

    Disallow: /infocus/iraq

    http://www.whitehouse.gov/infocus/iraq is a valid URL.

  • by Have Blue ( 616 ) on Monday October 27, 2003 @05:41PM (#7322367) Homepage
    If you try actually *loading* the directories listed in the robots.txt [whitehouse.gov], they don't exist. Not one. Not by going to their index.html or trying to find them through the site navigation. While they could still be accused of deleting them, many of the links are unlikely to have existed in the first place (http://www.whitehouse.gov/president/heartland-tou r-gallery/iraq? /president/holiday/decorations/iraq? /president/tee-ball-01/iraq? ) This may be just some IT grunt running a bad script on robots.txt.
  • bizarre (Score:2, Informative)

    by Anonymous Coward on Monday October 27, 2003 @05:41PM (#7322369)
    I can't see this as a conspiracy .. it's just too silly.

    Why on Earth wouldn't they just EDIT the bleedin' files? They wouldn't have to delete them or set up robots.txt, they would just change them to reflect the "message of the moment". They probably do that anyway, same as a lot of other sites.

    Do they really think people would be blocked by robots.txt?? Nobody's that dumb (yeah they could be Windows MSCE droids but c'mon).

    I think they did it for some other reason like keeping traffic down.

    Another possibility: a hacker got in there and did this because a) he only had write access to robots.txt for some reason or b) he wanted to play a subtle joke. But I doubt that too.

    Anyway this is strange, but pointless, so I wouldn't bother with it unless you're a democrat looking for something else to whine about...
  • by steveit_is ( 650459 ) on Monday October 27, 2003 @05:43PM (#7322397) Homepage
    Most of the pages in the robots.txt are actually 404's and dont exist anymore. Its that simple. Keeps the robots from constantly requesting content that doesn't exist anymore. A few are blocked because they are bandwidth intensive videos and things, and some others are blocked for more mundane reasons I assume.
  • by mrpuffypants ( 444598 ) * <mrpuffypants@gmailTIGER.com minus cat> on Monday October 27, 2003 @05:45PM (#7322419)
    Well, yes it would still be in google's search results if the GoogleBot hasn't crawled the whitehouse site since the change was made.

    Next time it crawls the site it won't read the forbidden directories and will delete them (if present) from the Google Cache, essentially erasing any official iraq history from google (and other search engines)
  • Wayback Machine (Score:3, Informative)

    by BLuP1 ( 641290 ) on Monday October 27, 2003 @05:45PM (#7322432)
    The Wayback machine does archive robots.txt, it seems like the whitehouse updates this file about every week or so. The current update happened after April 13th, 2003, and it simply took all of those references that said ".../.../.../text" and added /iraq as well.

    Seems odd and pointless to me. I'd like a statement explaining it. A lot like the "Disallow: /hidden/passwd" kind of entries.

  • by msheppard ( 150231 ) on Monday October 27, 2003 @05:46PM (#7322436) Homepage Journal
    Looks like someone just added IRAQ to all of the exsiting links. It's obviously some sort of search/replace/copy function. Go look for yourself, I found this one:

    Disallow: /firstlady/recipes/iraq

    Now, how many pages would this possibly block?

    M@
  • by jjn1056 ( 85209 ) <jjn1056@@@yahoo...com> on Monday October 27, 2003 @05:48PM (#7322454) Homepage Journal
    Looks like they removed a bunch of files where they were making claims that Saddam was behind 9/11. One could be lead to suspect that now that Bush got his war his doesn't need that lie anymore, and wants to erase all history of it since it undermines his authority.

  • by borkus ( 179118 ) on Monday October 27, 2003 @06:01PM (#7322602) Homepage
    An odd webmaster choice maybe? I wonder if they generate the robots.txt based on a 404 report - something like
    • Grep the errors log for 404's from search engines.
    • Parse out the directory paths.
    • Add those to robots.txt.
    Which might explain why at least one of the directories - /infocus/iraq/ - clearly has an index [whitehouse.gov]. However, if they moved or renamed a file under that path, it might be generating 404's. From personal experience, I've had bad requests from Googlebot for files that were over 4 years old.

    I have to agree that it's more strange than sinister. Besides, I'm not sure that the web site is the official archive for white house statements.
  • by dvdeug ( 5033 ) <dvdeug@@@email...ro> on Monday October 27, 2003 @06:03PM (#7322629)
    everyone knows they are also used to prevent google from indexing stuff people would rather keep (semi) private.

    The US government has no buisness with semi-private material. Either don't put it on the website, or make it publicly available to everyone, including Google and friends.
  • by davebo ( 11873 ) on Monday October 27, 2003 @06:07PM (#7322668) Journal
    The complaint is they've done it before - "combat operations are done" became "major combat operations are done" when the fighting didn't stop. You can check here [differentstrings.info].

    Compare the screenshots of what used to be on the white house website vs what's currently on the website.

    Yes, I know, "how do we know this blogger didn't alter the screenshots?" You don't.
  • by Black Parrot ( 19622 ) on Monday October 27, 2003 @06:53PM (#7323147)


    > There hasn't been a real declared war since WWII. You can't "declare war on terrorists" and be done with it either, wars are supposed to be declared on countries when you go to fight them.

    Also, US wars have to be declared by the Congress rather than by the White House... or at least that's the way it worked back when the Constitution still meant something.

  • Wayback Machine (Score:2, Informative)

    by Hender_Hole ( 675026 ) on Monday October 27, 2003 @07:25PM (#7323466)

    There are a lot of missing dates, but it looks to me like whitehouse.gov had a major site redesign sometime between Jul 13 and Sep 13 2001, and that when the new site was released they started putting in lots of the disallow statments for certain paths.

    From Jul 13:
    7-13 Whitehouse.gov [archive.org]
    7-13 Robots.txt [archive.org]

    From Sep 13:
    9-13 Whitehouse.gov [archive.org]
    9-13 Robots.txt [archive.org]

    It seems to me like the simplest explanation is just that their redesigned site has multiple paths to the same information, and for some reason they felt that their search engine rankings would improve if they eliminated superfluous paths. Although I'll admit it's suspicious that their old robots.txt from 2 years ago had 151 Disallows, and the one from today has 1552 Disallows, while the site uses basically the same navigation structure.

  • by saforrest ( 184929 ) on Monday October 27, 2003 @07:28PM (#7323489) Journal
    Other posters have claimed it's more than one. I haven't checked, so I don't know. However, even if it is just infocus/iraq, that's still a hell of a lot.

    That subdirectory seems to contain all or most of the transcripts of Ari Fleischer's and Bush's interviews and press conferences leading up to the war and after. An example is this:

    http://www.whitehouse.gov/infocus/iraq/excerpts_se pt26.html [whitehouse.gov]
  • by HungWeiLo ( 250320 ) on Monday October 27, 2003 @07:35PM (#7323542)
    He didn't ban media coverage. He banned cameras and recording equipment at homecomings which feature flag-draped coffins.
  • by dameron ( 307970 ) on Monday October 27, 2003 @08:28PM (#7323937)
    No, it's just the kind of subtle manipulation this administration has perfected. They probably realized that if they pulled all kinds of documents from the web site that it'd appear as if they were limiting access to the public record.

    It's all still there for all to see, but it's not as easy to find. So they can say "We're not hiding anything." while they actually hide it.

    Things that become inconvenient or embarrassing after the fact are hard to hide. At the time this quote by Dick seemed reasonable: link [whitehouse.gov]

    "Simply stated, there is no doubt that Saddam Hussein now has weapons of mass destruction. There is no doubt that he is amassing them to use against our friends, against our allies, and against us."

    Now maybe less so. Also, re: the Uranium production in Africa, Fleisher sounds like a complete fool.
    This is the first example of the Bush administration confronting the forged Iraq/African Uranium document. This is from March, 14th 2002.

    On March 17th 2002 Bush gives Hussein 48 hours to leave Iraq and on the 19th he launched "Operation Iraqi Freedom".

    So for at least a week -before- the shooting started the Bush administration had reporters at press conferences asking questions about the forged uranium documents. The mainstream press didn't pick up on this story until July.

    Link [whitehouse.gov]

    Q Ari, the President said in his State of the Union address, the British government has learned that Saddam Hussein recently sought significant quantities of uranium from Africa. And since then, the IAEA said that those were forged documents --

    MR. FLEISCHER: I'm sorry, whose statement was that?

    Q The President, in his State of the Union address. Since then, the IAEA has said those were forged documents. Was the administration aware of any doubts about these documents, the authenticity of the documents, from any government agency or department before it was submitted to the IAEA?

    MR. FLEISCHER: These are matters that are always reviewed with an eye toward the various information that comes in and is analyzed by a variety of different people. The President's concerns about Iraq stem from multiple places, involving multiple threats that Iraq can possess, and these are matters that remain discussed.


    Fleischer stalls for time by pretending that he didn't understand the source of the quote (as if "President" and "State of the Union" in the first sentence were unclear), then comes up with a moronic bit of doublespeak. No wonder he quit. Read his last sentence in that press conference aloud. That's sentence is the official line one week before the war. Lots of confidence there.

    If the whitehouse can make it a little more difficult for reporters or their opponents to dig up embarrassing quotes or timelines you can bet your last dollar they will. -dameron
  • by drf5n ( 561106 ) on Monday October 27, 2003 @08:54PM (#7324183)
    See:
    http://www.whitehouse.gov/news/releases/2003/05/te xt/20030501-15.html [whitehouse.gov]

    which differs from
    http://www.whitehouse.gov/news/releases/2003/05/ir aq/20030501-15.html [whitehouse.gov]

    In the text version, the pages says 'President Bush Announces Combat Operations in Iraq Have Ended' while in the robot accessible version, it is ''President Bush Announces Major Combat Operations in Iraq Have Ended'.

    Get your own screenshots.
  • Re: and your ... (Score:1, Informative)

    by Anonymous Coward on Monday October 27, 2003 @09:06PM (#7324281)
    Try almost an entire year of being AWOL.
    Right near the top of the page you linked to we find:
    AWOL----absent for 30 days or less.
    You know, it's people like you that make it so easy for the Bush administration to dismiss its detractors. I have wondered lately if the administration is not actually behind the more ridiculous claims made against it in an effort to actually discredit all of those who would offer criticism of it.
  • Someone's been busy (Score:4, Informative)

    by billybob2001 ( 234675 ) on Monday October 27, 2003 @10:06PM (#7324744)
    For instance:
    http://www.whitehouse.gov/infocus/iraq/ 100days


    Not any more.

    Although the current Google cache [216.239.59.104] lists

    /infocus/iraq
    /infocus/iraq/100days/iraq
    /infocus/iraq/100days/text
    [snip 22 lines]
    /infocus/iraq/photoessay/iraq
    /infocus/iraq/photoessay/text
    /infocus/iraq/text



    the current robots.txt leaps from
    /infocus/internationaltrade/text
    to
    /infocus/judicialnominees/iraq

    Conspiracy theory over...

    ...or is it?

  • Referring to a website critical of him (but correct in every detail)

Beware of Programmers who carry screwdrivers. -- Leonard Brandwein

Working...