Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet Your Rights Online

Poor Spelling Beats Google's China Filter 248

antifoidulus writes "CNN's money section contains a blurb(among other blurbs) about how poor spelling can beat Google's Chinese filter. The example given in the article is that a search for "Tiananmen" will yield peaceful pictures of the square, but a search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks."
This discussion has been archived. No new comments can be posted.

Poor Spelling Beats Google's China Filter

Comments Filter:
  • by AltGrendel ( 175092 ) <ag-slashdot&exit0,us> on Tuesday January 31, 2006 @08:30AM (#14606164) Homepage
    I are a gud spelr!
    • by Anonymous Coward on Tuesday January 31, 2006 @08:37AM (#14606206)
      I are a gud spelr!
      Did you mean: Tibet should be free
      • by networkBoy ( 774728 ) on Tuesday January 31, 2006 @09:18AM (#14606439) Journal
        Ironically this could force China to improve the overall education level, which would of course backfire as an educated populace is much more difficult to control.
        [/longshot]

        -nB
        • That's OK - we'll just get our newspapers to misspell their headlines [thisisaberdeen.co.uk]
          • Well, American newspapers have mostly spelled it "Tienanmen" all along. I had to carefully compare the two spelling to spot the difference from "Tiananmen", and I was surprised to see that one labelled as correct.

            Of course, Roman-alphabet spelling of Chinese has a long history of variants. I'd think that, if they are serious about censoring google in China, they should do a careful study of all the spelling systems in use, and block all of the variants.

            Of course, they'll also end up blocking a lot more stuf
    • by Anonymous Coward
      Can you spell Bukcake? Or Pusy? Or AZZ? Get that by the filters!!!! But seriously, this is where pr0n comes from, the spelling that is, to get by filters...
    • by Anonymous Coward on Tuesday January 31, 2006 @08:46AM (#14606256)
      Thanks for your feedback. We will endeavour to respond to your bug report as soon as possible, and release an update if appropriate.

      Sincerely,

        Google information liberation management team
        Google Inc. "Do no evil."
  • Obvious (Score:5, Interesting)

    by poeidon1 ( 767457 ) on Tuesday January 31, 2006 @08:33AM (#14606173) Homepage
    that not everything can be filtered but this is a search using english alphabets. How good (read horrible) is the filter which searches using chinese langauge ?
    • Re:Obvious (Score:5, Insightful)

      by 246o1 ( 914193 ) on Tuesday January 31, 2006 @09:01AM (#14606338)
      In Chinese, a single character ( for example -- though I'm not sure if this will display properly) represents a whole syllable (as well as a meaning or idea), rather than a consonant or vowel, as most English letters do (some are unpronounced, or just change the sound of another letter).

      This eliminates certain types of bad spellings, obviously, but opens certain avenues that aren't available in English, such as choosing characters with similar meanings but different sounds, or similar sounds but different meanings.

      For the Tiananmen example, the characters for TianAnMen () mean "Heaven," "Peace," "Gate." Heaven could be replaced with "Sky," which has a completely different sound, or "Money," which (if I rcall correctly) is pronounced "Qian" (Q sounds close to English CH). This could also happen with with the other two characters in this word, and of course for many other 'bad' words.

      The reason that common words like "pr0n" have become associated with porn, or other examples, is that a community of users agreed upon a certain misspelling of those words, and the same can and WILL happen in China to evade whatever filters search engines use. There is no way to have an even semi-open search system that doesn't allow human ingenuity to overcome its filters, and the brief history of the internet in the west indicates that these filters will, ultimately, be only partially and temporarily effective.
      • Re:Obvious (Score:5, Informative)

        by Heian-794 ( 834234 ) on Tuesday January 31, 2006 @09:15AM (#14606425) Homepage

        I can only add that the Chinese government, with their insistence on the not-at-all-intuitive-to-non-Chinese-speakers romanization system that is Pinyin, have only themselves to blame.

        Ask a number of reasonably educated people whose native languages use the Roman alphabet to listen to a Chinese person pronounce "Tiananmen" and then write down what they think the spelling should be. I guarantee many of them will "misspell" it as "Tienanmen", since the vowel in question is pronounced like the sound that most languages express with an "e".

        Expect more of this as Pinyin isn't going away any time soon.

        (And yes, I do have my flame-retardant jacket, Academic Dispute Wear Edition, all prepared!)

        • Pinyin is so complicated -- there are likely to be lots of problems. the wikipedia has a great article on Pinyin.

          Now I know why it reads so funny; it is just meant to use the Roman alphabet, but not in any particularly standard way. E.g. 'x' is like 'SH' -- just because. Hiyaaaaaaah!
          • Re:Obvious (Score:3, Funny)

            by Articuno ( 693740 )
            In my language X sounds like SH, your insensitive clod! (couldn't resist :-)


            ps: I speak portuguese, that's why X can sound like SH... I don't know about other languages, but i'd guess this happens to other latin-based ones :-)
            • Re:Obvious (Score:4, Informative)

              by Heian-794 ( 834234 ) on Tuesday January 31, 2006 @09:57AM (#14606731) Homepage

              Putko, they did of course have standards, but they only make sense if you already speak Chinese.

              "Tian" does not rhyme with "fan", but somehow, "duo" and "luo" rhyme with "po" and "fo", which do contain "u" sonuds in the middle; they just aren't written because plain "po" doesn't exist.

              One of the purposes of pinyin was a potential replacement of the character system with it, so I can understand them not considering the interests of non-native speakers, but if you're going to force it on non-natives too, well, expect to see spelling "errors" becmoe unavoidable when they use Chinese.

              • From looking at it, it seems that Pinyin is quite consistent! I get the feeling Wade-Giles is better for gweilo, while Pinyin serves some Chinese needs. I just hope they change the spellings of words are pronunciations drift.
          • X in portuguese can sound like "sh", like "z" or like "s" if placed before a "c".

            not all languages pronounce the leters the same. case in point: "J".
        • Re:Obvious (Score:2, Insightful)

          by drauh ( 524358 )
          meh. english romanization is not at all intuitive to non-english speakers: "cough", "ghost", "cant", "cent", "through", "trough". at least pinyin is consistent.
  • by Lord_Slepnir ( 585350 ) on Tuesday January 31, 2006 @08:33AM (#14606174) Journal
    This gives me an idea of how I can get past Bush and Co. monitoring my internet usage. I'll be able to say with a straight face that I never searched for Porn, but rather I was hoping to find information about shellfish [google.com]
  • by RobotRunAmok ( 595286 ) on Tuesday January 31, 2006 @08:33AM (#14606177)
    ...as a Leader of the Revolution.
  • Heh. (Score:5, Insightful)

    by Perseid ( 660451 ) on Tuesday January 31, 2006 @08:34AM (#14606185)
    Kind of reminds me of when Napster installed that half-assed search filter. Midonna and Mitallica suddenly became quite popular.

    People who want to get information will get it, and you can't stop them.
  • by eldavojohn ( 898314 ) * <eldavojohn@gSTRAWmail.com minus berry> on Tuesday January 31, 2006 @08:34AM (#14606187) Journal
    As we all know, Google has a patented page ranking system [google.com] that calculates the correlation of words with websites. It does this (primarily) by reading links from all of its cached websites and parsing html links to determine what words are being used to describe the page in the link.

    A while back, this was known as Google Bombing [wikipedia.org] and certain individuals exploited Google's system very effectively by linking to pages with words that, by all rights, were not very accurate. After all, do a Google search for the word 'failure' [google.com] and the top site is George W. Bush's Whitehouse domain Biography.

    So what do you do to help the Chinese? Perhaps you could make a page with two columns. In one column would be the correct text with no link and the key word. In the other column would be all the permutated misspellings with links to the real sites. You could host this one your website and send it to friends asking them to also host it. They would need to slightly alter it and host it but it would effectively provide the page ranks for the misspellings and allow anyone in China (who has access to your page) a key if they need it.
  • Perfect Example... (Score:5, Insightful)

    by oneiron ( 716313 ) on Tuesday January 31, 2006 @08:35AM (#14606188)
    This is a perfect example of why I've been saying all along that google is making the right decision in cooperating with the Chinese Government: http://yro.slashdot.org/comments.pl?sid=175251&cid =14571383 [slashdot.org]
  • Interesting. (Score:5, Interesting)

    by BoneFlower ( 107640 ) <anniethebruce AT gmail DOT com> on Tuesday January 31, 2006 @08:35AM (#14606191) Journal
    Now was this simply a failure of the filter method used, or did google deliberately create a weak filter to subvert the effort?
    • Re:Interesting. (Score:3, Interesting)

      Google have done exactly what they were asked to do.
      Its like when the RIAA/MPAA ask to filter results from torrent sites - the exact request is blocked but variations continue.

      Censorship is futile and those who want the information can get it.
    • Re:Interesting. (Score:4, Insightful)

      by darkmeridian ( 119044 ) <william.chuang@g ... m minus language> on Tuesday January 31, 2006 @08:55AM (#14606302) Homepage
      Google has really good suggested search terms for typos. Hint, hint. Skeet, skeet.
    • WTF did you think Google was doing? Bringing all their technical prowess to bear and create an über-filter?!

      They are complying with the Chinese government's censorship rules, nothing more. They know that it's the only way that they'll get google in China AND they know that there is absolutely no way that the Chinese government's blacklist will block *everything*.

      Rock, hard place; so they chose to go into China knowing that it would be more for the good even though they likely knew that their rabid fanb
  • Tanks (Score:2, Interesting)

    by capnspanky ( 886803 )
    ...search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks.

    So I did a Google search and all those pictures of tanks are basically one photo hosted on different sites.
    • Re:Tanks (Score:5, Interesting)

      by magarity ( 164372 ) on Tuesday January 31, 2006 @09:43AM (#14606611)
      It's not just any picture of tanks; it's the picture of that guy who paused on the way home from shopping to stand in front of four tanks. You know, big metal machines that can squash a pedestrian flat without noticing? Amazingly, as famous as this picture is it is unknown inside China. My Chinese friends in college had never seen it or anything of those ill fated demonstrations despite being in Beijing when it was happening. The word on the street in town during the protests was simply that 'something is happening' and everybody better stay in their homes if they know what's good for them. The Chinese government's crackdown on the media is impressively (depressingly?) comprehensive.
      • Um. This is not interesting. This is garbage nonsense, and your college friend is utterly retarded, to put it mildly.

        Anybody who was in Beijing during that time would have known about protests and tanks. Now I could understand if they did not know the exact number of deaths/arrest or that sort of thing. But "never seen tanks" just doesn't fly.

        But then I have talked to Japanese teens (who should be in their mid-20's now) who thought Japan won World War II. So there.
        • Uh, okay, "HungWeiLo", but this is pretty funny:

          But then I have talked to Japanese teens (who should be in their mid-20's now) who thought Japan won World War II. So there.

          Well there are American teens who don't know where Native Americans are from (not "Indians", Native Americans). So ha! We're still ahead in the War on Knowing Stuff.
  • by TFGeditor ( 737839 ) on Tuesday January 31, 2006 @08:36AM (#14606196) Homepage
    Who would have thought a thechnique spammers use to beat filters would have real-world value.

    Is Google's filter Baysian based?
    • First - I don't think it would have any "real-world value". Using words like "warez" may have some "real-world value" but I think the moment some misspelled word becomes a dissident symbol, Google would have to filter it out.

      Second - let's all not forget that Chinese don't quite "spell" it when writing. I don't know how well (if at all) bayesian filtering and stuff would work for "kanji" (or how do they call it?)
      • Kanji is the Japanese term for Chinese characters. In Mandarin it is hanzi. For the sake of completeness, it's hanja in Korean.
      • by wumingzi ( 67100 ) on Tuesday January 31, 2006 @10:10AM (#14606829) Homepage Journal
        I don't know how well (if at all) bayesian filtering and stuff would work for "kanji"

        All right, this question has come up several times in the thread.

        The Mandarin dialect has approximately 31 phonetic components. These can be combined as single phoneme, dual phoneme, and triple phoneme groups. Some sounds always stand alone, some combine into triples, some do not. Some phonemes only exist as initials. Some only as finals, etc. etc. The end result is a hundred-odd unique phonetic combinations.

        Then there are tones. Five tones per phonetic combination. There are a few sounds that never appear in certain tone patterns, but this is the exception, and not the rule. So this brings us up into mid 3-digits of total possible sound groupings, including intonation.

        Now, you've probably heard somewhere that there are thousands of characters. So if there are only a few hundred unique sounds, but thousands of characters, of course, you have homonyms everywhere.

        (I was going to do a demo of how this works, but /. doesn't like me writing in hanzi. Go to http://www.zhongwen.com/ [zhongwen.com] and go to the "pronunciation" section of the dictionary. You'll see it as clear as day that way).

        Now, the problem is that there are many characters mapping to each sound. As such, while you can only mess with English words so much before they become unrecognizable (porn, pron, pr0n, prawn, etc.), you can make hundreds of permutations of any common phrase in Chinese simply by swapping out the correct character for a different one.

        I am not aware of a Chinese version of l33t-speak. There's trashy, slang Chinese, sure. But either you have the right character, or you don't. Without a standard nomenclature for screwing up words, it becomes hard to try alternate 'spellings' to work around the filter.
  • by brxndxn ( 461473 ) on Tuesday January 31, 2006 @08:39AM (#14606216)
    So.. Chinese people speaking the same broken Engrish on the Internet as they typically do elsewhere beats the Great Firewall of China.

    Engrish in the spirit of Freedom!

    • Engrish in the spirit of Freedom!

      All your base are Beijing to us !
  • Not for long (Score:5, Insightful)

    by GoatMonkey2112 ( 875417 ) on Tuesday January 31, 2006 @08:40AM (#14606221)
    It would probably be better to *NOT* point these things out.
    • by lxs ( 131946 )
      Google search results for: "falung kong" 0

      did you mean: "Please report me to the authorities" ?
    • Maybe, but they can't possibly keep up with all the potential mis-spellings. And I wonder if they've considered the slang problem; after all, it would be simple enough to make the content available through slang-terms, written into the "alt" attribute of images or dumped to meta data.
  • Type of filter (Score:4, Interesting)

    by 19061969 ( 939279 ) on Tuesday January 31, 2006 @08:40AM (#14606225)
    So (serious question to those more knowledgeable) does this mean that the Google filters are simple keyword matches then? I'm surprised because I would have though that they might have used something more complicated like cluster analysis. For example latent semantic analysis [wikipedia.org] could well have noted mis-spellings of words and clustered them together with the correct spelling thus allowing the misspellings to be filtered out too.

    LSA is useful for dealing with synonyms, so I cannot see any reason why it wouldn't work with misspellings (assuming that they're common).

  • People whining about Google's actions with respect to China fail to realize that the alternatives (even more dreadful Chinese filtering, Google being banned entirely, etcc) are worse alternatives for Chinese freedom.
    • I'd rather see Google grand stand about not bowing to China's governmental pressure to assist in forceful suppression of ideas. Yes, that may get Google banned in China. However, Google is so big and powerful everywhere else in the world that news of its existence and popularity would become known to some curious folks in China who would begin to resent their government for banning it. In that resentment you'll find the seeds for a transforming change. That's a more self aware path to change than embracin
  • by ColdCoffee ( 664886 ) on Tuesday January 31, 2006 @08:42AM (#14606236)
    ...and so the weakness of computers is revealed: people and their presumption of perfection.
  • Friedums just anoder werd for nuthin lef 2 looze, and nuthin aint werth nuthin but it's Frie.

    I'm so dam Ronery :(
  • Of course, why do people think l33t sp33k was invented in the first place?
  • "The more you try to out-think the plumbing, the easier it is to stop up the drain." - Cmdr. M. Scott

    If there is one thing that many of us have learned over the course of our internet-connected lives is the simple fact that there is a work-around for EVERYTHING.

    There has yet to be a copy protection scheme that hasn't been defeated. There is no internet filter that can't be bypassed, and no blocking that can't be dodged.

    What the Chinese need to learn is that their efforts are as futile as attacking a funny f
    • Re:Whoopsie (Score:2, Insightful)

      by NewWorldDan ( 899800 )
      They aren't necessarily out to defeat the determined. They can however, quickly and easily sanitize the popular perceptions by sweeping things under the rug. To the average citizen, they do a little search and never see anything particularly shocking. Mission accomplished. And as I said, given time, the determined will eventually get their message across. The Internet just adds another layer to a game that's been going on since the dawn of government.
  • I guess this workaround will be quietly blocked at some stage ... until the next workaround emerges. Google are in too deep now, though. Their China venture is a whopping mistake, imho. The company whose business pitch is that we should trust it with the world's information falls at the first hurdle by showing it cannot be trusted with even a part of the world's information if the bribe is large enough.
  • It was good while it lasted.
  • I was going to reply with something along the lines of a resounding "DUH!!!" (remember the last days of Napster?), but Taco's from the see-thats-why-i-misspell-stuff dept. made me laugh out loud and forgot what I wanted to say. Well done :)
  • ...when he said something to the effect:

    "The more you overtake the plumbing, the easier it is to clog the drain."

    China has a Maginot-Line mentality, and their censorship efforts will eventually fail just a miserably.

    (ST flames and corrections, and French jokes, may commence now.)
  • SHUT UP!

    Do you want to ruin it?

    Come on, damnit! Shutupabout it.

    Consider this the "getting your foot kicked under the table" move.

    • Teh ferst rul ov gugul, iz dont talk abowt gugul.
    • by wumingzi ( 67100 ) on Tuesday January 31, 2006 @10:33AM (#14606979) Homepage Journal
      This seems as good a place to bring it up as any.

      Let's do a thought experiment.

      On one side, we have a reasonably interesting search engine company.

      On the other, we have a control-minded, autocratic government.

      The search engine company (that wants to operate in China) is told by the autocratic government "We don't want Bad Things sneaking in through the search engine. Keep Bad Things out."

      The search engine company says "OK. We'll play along. Give us a list of things you don't want to see. We'll get rid of them".

      "Taiwan Independence" returns 0 results.

      "Free Tibet" is delinked.

      Various combinations of Tiananmen, 6 and 4 mysteriously vanish.

      Unfortunately, Bad Things do not fit into nice little boxes. People mis-spell words. While it is easy to come up with a list of sites that contain Bad Things you do not want to see, new sites come up all the time. Is my friend's picture gallery from Tiananmen just some postcards to the folks back come, or is there some subtle political commentary in there? Well, you'll have to read it and find out.

      If I search on (former Taiwanese president) Lee Teng-Hui, does that contain Bad Things? Does it link to Bad Things? How dangerous is a stooped 85 year-old former college professor anyhow?

      Is Ghandi axiomatically Bad? Martin Luther King? Doesteyevsky? The list goes on and on and on.

      The censors can control the obvious things. Ultimately, they will lose.

      The real problem is that China is, for all its faults, a modern country. People come in, people fly out. When I go to China, lots of people ask what's going on in the outside world. I am a little circumspect in what I say, but my memory banks don't magically get erased when I cross over from Hong Kong to Shenzhen. Over 90% of the Chinese students you see toiling away at your local research university will ultimately go home. That's just the way it goes. They too don't forget whatever subversive thoughts may have crept into their heads during five or six years of study abroad.

      The deck is stacked, and the good guys will ultimately win.
  • They're filtering English mispellings, but what about French, Spanish, or German? A Chinese person could just search for what they're looking for under different languages. Granted, English is taught in China in their schools to everyone, but the folks who know other languages can start getting things and spreading it to the others.
  • A) Google guesses what you are trying to spell, and does it very well.

    B) This is an oversight that would be easily corrected.

    C) You just announced it publically and unignorably.

    D) Most of the people censored don't spell it with latin characters anyway.
  • As much as many of you would like to think that Google "slipped this in" on purpose I have news. Google announced they shall do business in China, and will do whatever it takes to do so.

    This is no intentional 'hack' of the system. It's a new content filter and there's going to be holes to be patched and creative solutions to be found for creative problems.

    So before you go hail the Google dev team as being revolutionary, maybe you should consider they just missed the mark the first time around and have a l
  • why'd yall have to go and blab about this? don't you think the people who most benefit from this loophole could learn by word of mouth? No the chinese govt knows to go beat up Google. can't you just see the RFQ: "prease submit bid to peopers minsitry of truth. We seek bids and proposars for sperring checker prug-ins and key roggers"
  • It looks like slashdot will always be visible in China. :-)

  • Thanks for blowing it for the Chinese...putting a link to some backwater news site on the front page of Slashdot.

    On a more serious note, couldn't people who are not in China put up a little proxy to return Google results? For example, I have a domain hosting a few pages. Could I put a little script to take a query entered at my site and return results obtained from Google?
  • The combination should be quite amusing and effective...
  • by DigDuality ( 918867 ) on Tuesday January 31, 2006 @09:35AM (#14606550)
    Chinese web users can see full, uncensored results for their Google search by replacing "&meta=" with "&meta=cr%3DcountryBR" in the URL. Once the string is replaced, the censorship will not affect the results.

    This is what a chinese search for Democracy looks like after this method has been applied:

    http://www.google.cn/search?hl=zh-CN&q=democracy+c hina&btnG=%E6%90%9C%E7%B4%A2&meta=cr%3DcountryBR [google.cn]
  • by saleenS281 ( 859657 ) on Tuesday January 31, 2006 @09:41AM (#14606594) Homepage
    Am I the only one thinking "why are we adveritising this so they modify their filters and improve them"? That's great that people are finding ways around the filters... but maybe keep that on the down low??
  • Changes in capitalization also work, for now.
  • by neo ( 4625 ) on Tuesday January 31, 2006 @09:59AM (#14606748)
    Look... as much grief as Google is getting for this, they know hackers are going to get past the wall. The Great Fire Wall of China will work about as well as the original did. It's there to make a point and it's not going to stop anyone.
  • If this was going to be so insightful, you'd think I would've gotten a mod up when I posted this the first time [slashdot.org].
  • I found [metafilter.com] many pictures of tanks the other day, when the news of GIS.cn's censorship was posted on metafilter. Including a few chinese character queries (including tian-an-men tan-ke). One of the things to remember is that the chinese are going to be searching in chinese characters, not english.

    Searching for something as simple as "tank man" or "tank square" on GIS.cn will get you the pic you're looking for, btw. As long as you don't include "tiananmen" in the query, you'll get it.
  • Could bring a whole new meaning to the expression "spelling/grammar nazi" if the Chicoms decide to start rejecting queries with too many non-OED words.

  • Now please report to the education center for re-Nedification
  • Didn't people in WoW reject chinese players who couldn't spell sentences correctly?

    This reminds me of the phrase: "Your famine is my feast".
  • By publishing the "work arounds" the media (including all youze geeks on /.) have done the fascist pigfuckers running the Chinese .gov a Big Fat Favour - you have saved them hours and hours of research in finding pages TO BLOCK.

    Example: Teeanamen Skware.

    An incorrect spelling like that gets published, say HERE, and is noted by some Chinese equivalent of Winston Smith in the Chinese Minitrue, and its passed over to the directorate for inclusion on words to ban. Eventually you run out of room to run, even

  • Check back tomorrow, and learn that anything will beat slashdot's dupe filter.
  • It would be easy for Google to start filtering misspellings, since they already have the engine to map misspellings to correct spellings. It will be interesting to see Google's response. Will they plug this hole? If so, they would take on an aura of direct corroboration with communist dictators, as in, "How dare you poke your head up -- ! How dare you read that -- !"

    Google so far has been taking the high ground by saying in effect that the Chinese public now has more information than they previously had

  • The Chinese will want it fixed. This should be the "worst kept secret" not news. :-)

After all is said and done, a hell of a lot more is said than done.

Working...