Poor Spelling Beats Google's China Filter 248
antifoidulus writes "CNN's money section contains a blurb(among other blurbs) about how poor spelling can beat Google's Chinese filter. The example given in the article is that a search for "Tiananmen" will yield peaceful pictures of the square, but a search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks."
That's unpossible! (Score:5, Funny)
Re:That's unpossible! (Score:5, Funny)
Re:That's unpossible! (Score:5, Funny)
[/longshot]
-nB
Re:That's unpossible! (Score:2)
Re:That's unpossible! (Score:2)
Of course, Roman-alphabet spelling of Chinese has a long history of variants. I'd think that, if they are serious about censoring google in China, they should do a careful study of all the spelling systems in use, and block all of the variants.
Of course, they'll also end up blocking a lot more stuf
Re:That's unpossible! (Score:2, Insightful)
Re:That's unpossible! (Score:2)
But as the previous poster aptly noted (flamebait or not) I never asserted that the US population is educated. In fact, I believe that the general decline of educational rigor in our country is largly responsible for its decline.
-nB
Re:That's unpossible! (Score:2, Insightful)
Bug report successfully submitted (Score:4, Funny)
Sincerely,
Google information liberation management team
Google Inc. "Do no evil."
Obvious (Score:5, Interesting)
Re:Obvious (Score:5, Insightful)
This eliminates certain types of bad spellings, obviously, but opens certain avenues that aren't available in English, such as choosing characters with similar meanings but different sounds, or similar sounds but different meanings.
For the Tiananmen example, the characters for TianAnMen () mean "Heaven," "Peace," "Gate." Heaven could be replaced with "Sky," which has a completely different sound, or "Money," which (if I rcall correctly) is pronounced "Qian" (Q sounds close to English CH). This could also happen with with the other two characters in this word, and of course for many other 'bad' words.
The reason that common words like "pr0n" have become associated with porn, or other examples, is that a community of users agreed upon a certain misspelling of those words, and the same can and WILL happen in China to evade whatever filters search engines use. There is no way to have an even semi-open search system that doesn't allow human ingenuity to overcome its filters, and the brief history of the internet in the west indicates that these filters will, ultimately, be only partially and temporarily effective.
Re:Obvious (Score:5, Informative)
I can only add that the Chinese government, with their insistence on the not-at-all-intuitive-to-non-Chinese-speakers romanization system that is Pinyin, have only themselves to blame.
Ask a number of reasonably educated people whose native languages use the Roman alphabet to listen to a Chinese person pronounce "Tiananmen" and then write down what they think the spelling should be. I guarantee many of them will "misspell" it as "Tienanmen", since the vowel in question is pronounced like the sound that most languages express with an "e".
Expect more of this as Pinyin isn't going away any time soon.
(And yes, I do have my flame-retardant jacket, Academic Dispute Wear Edition, all prepared!)
Re:Obvious (Score:2)
Now I know why it reads so funny; it is just meant to use the Roman alphabet, but not in any particularly standard way. E.g. 'x' is like 'SH' -- just because. Hiyaaaaaaah!
Re:Obvious (Score:3, Funny)
ps: I speak portuguese, that's why X can sound like SH... I don't know about other languages, but i'd guess this happens to other latin-based ones
Re:Obvious (Score:4, Informative)
Putko, they did of course have standards, but they only make sense if you already speak Chinese.
"Tian" does not rhyme with "fan", but somehow, "duo" and "luo" rhyme with "po" and "fo", which do contain "u" sonuds in the middle; they just aren't written because plain "po" doesn't exist.
One of the purposes of pinyin was a potential replacement of the character system with it, so I can understand them not considering the interests of non-native speakers, but if you're going to force it on non-natives too, well, expect to see spelling "errors" becmoe unavoidable when they use Chinese.
Re:Obvious (Score:2)
Re:Obvious (Score:2)
not all languages pronounce the leters the same. case in point: "J".
Re:Obvious (Score:2, Insightful)
This gives me an idea... (Score:5, Funny)
Finally, something good comes from spammers! (Score:3, Funny)
-Eric
Re:Finally, something good comes from spammers! (Score:2)
-Eric
Still busted (Score:2)
Nice lighting.
Don't you know... (Score:3, Funny)
Re:Don't you know... (Score:2)
Another fun thing is to continue reading after those biblical verses. A bit further on, in both cases, there are explicit exceptions for locusts, crickets and grasshoppers. Those all exlicitly listed as kosher for human consumption. But I've never seen any Jews or Christians eating them, despite God's advice that they're good.
Religious people sometim
31337 (Score:2)
r33l!, th3 b!7 @b0u7 $p3llin9 !$ n0 $urpr!$3 70 31337 h@><0rz....
Re:This gives me an idea... (Score:2)
Taco, You Got a Great Career Ahead Of You... (Score:5, Funny)
Re:Taco, You Got a Great Career Ahead Of You... (Score:2)
Heh. (Score:5, Insightful)
People who want to get information will get it, and you can't stop them.
Exploiting Google's Page Rank (Score:5, Interesting)
A while back, this was known as Google Bombing [wikipedia.org] and certain individuals exploited Google's system very effectively by linking to pages with words that, by all rights, were not very accurate. After all, do a Google search for the word 'failure' [google.com] and the top site is George W. Bush's Whitehouse domain Biography.
So what do you do to help the Chinese? Perhaps you could make a page with two columns. In one column would be the correct text with no link and the key word. In the other column would be all the permutated misspellings with links to the real sites. You could host this one your website and send it to friends asking them to also host it. They would need to slightly alter it and host it but it would effectively provide the page ranks for the misspellings and allow anyone in China (who has access to your page) a key if they need it.
Re:Exploiting Google's Page Rank (Score:2, Interesting)
They aren't stupid.
Re:Exploiting Google's Page Rank (Score:2)
Hang on, I thought you were going to give an example of googlebombing leading to inaccurate results?
Now, on the other hand, what b3ta did to Damon Albarn really wasn't nice...
Re:Exploiting Google's Page Rank (Score:2, Interesting)
Here, in India, it's still Bush http://www.google.co.in/search?num=100&hl=en&safe= off&as_qdr=all&q=failure&btnG=Search&meta= [google.co.in]
Google has never before given me different search results for google.co.in and google.com
This is the first time I'm seeing different results for these two domains.
Re:Exploiting Google's Page Rank (Score:2)
*brain explodes*
Perfect Example... (Score:5, Insightful)
Re:Perfect Example... (Score:4, Insightful)
Interesting. (Score:5, Interesting)
Re:Interesting. (Score:3, Interesting)
Its like when the RIAA/MPAA ask to filter results from torrent sites - the exact request is blocked but variations continue.
Censorship is futile and those who want the information can get it.
Re:Interesting. (Score:4, Insightful)
Re:Interesting. (Score:2)
They are complying with the Chinese government's censorship rules, nothing more. They know that it's the only way that they'll get google in China AND they know that there is absolutely no way that the Chinese government's blacklist will block *everything*.
Rock, hard place; so they chose to go into China knowing that it would be more for the good even though they likely knew that their rabid fanb
Tanks (Score:2, Interesting)
So I did a Google search and all those pictures of tanks are basically one photo hosted on different sites.
Re:Tanks (Score:5, Interesting)
Re:Tanks (Score:2)
Anybody who was in Beijing during that time would have known about protests and tanks. Now I could understand if they did not know the exact number of deaths/arrest or that sort of thing. But "never seen tanks" just doesn't fly.
But then I have talked to Japanese teens (who should be in their mid-20's now) who thought Japan won World War II. So there.
Re:Tanks (Score:2)
But then I have talked to Japanese teens (who should be in their mid-20's now) who thought Japan won World War II. So there.
Well there are American teens who don't know where Native Americans are from (not "Indians", Native Americans). So ha! We're still ahead in the War on Knowing Stuff.
Valuable Lesson from Spammers (Score:4, Insightful)
Is Google's filter Baysian based?
Re:Valuable Lesson from Spammers (Score:3, Interesting)
Second - let's all not forget that Chinese don't quite "spell" it when writing. I don't know how well (if at all) bayesian filtering and stuff would work for "kanji" (or how do they call it?)
Re:Valuable Lesson from Spammers (Score:3, Informative)
Re:Valuable Lesson from Spammers (Score:5, Informative)
All right, this question has come up several times in the thread.
The Mandarin dialect has approximately 31 phonetic components. These can be combined as single phoneme, dual phoneme, and triple phoneme groups. Some sounds always stand alone, some combine into triples, some do not. Some phonemes only exist as initials. Some only as finals, etc. etc. The end result is a hundred-odd unique phonetic combinations.
Then there are tones. Five tones per phonetic combination. There are a few sounds that never appear in certain tone patterns, but this is the exception, and not the rule. So this brings us up into mid 3-digits of total possible sound groupings, including intonation.
Now, you've probably heard somewhere that there are thousands of characters. So if there are only a few hundred unique sounds, but thousands of characters, of course, you have homonyms everywhere.
(I was going to do a demo of how this works, but
Now, the problem is that there are many characters mapping to each sound. As such, while you can only mess with English words so much before they become unrecognizable (porn, pron, pr0n, prawn, etc.), you can make hundreds of permutations of any common phrase in Chinese simply by swapping out the correct character for a different one.
I am not aware of a Chinese version of l33t-speak. There's trashy, slang Chinese, sure. But either you have the right character, or you don't. Without a standard nomenclature for screwing up words, it becomes hard to try alternate 'spellings' to work around the filter.
Re:Valuable Lesson from Spammers (Score:3, Insightful)
This is irony at best... (Score:5, Funny)
Engrish in the spirit of Freedom!
Re:This is irony at best... (Score:2)
All your base are Beijing to us !
Re:This is irony at best... (Score:2)
I still remember trying to decipher a manual for a Fanuc CNC control computer, the kind of computer that controls the motion of an industrial laser. Never could find the setting for parity, and I spent two hours on the phone with a "Tanglish-only" speaker. God, what a headache I had!
Not for long (Score:5, Insightful)
Re:Not for long (Score:3, Funny)
did you mean: "Please report me to the authorities" ?
Re:Not for long (Score:2)
Type of filter (Score:4, Interesting)
LSA is useful for dealing with synonyms, so I cannot see any reason why it wouldn't work with misspellings (assuming that they're common).
Re:Type of filter (Score:2)
This is exactly why I said Google was good! (Score:2, Redundant)
Re:This is exactly why I said Google was good! (Score:3, Insightful)
The weakness of computers (Score:3, Insightful)
Friedums just anoder werd (Score:2)
I'm so dam Ronery
l33t sp33k (Score:2)
Whoopsie (Score:2)
If there is one thing that many of us have learned over the course of our internet-connected lives is the simple fact that there is a work-around for EVERYTHING.
There has yet to be a copy protection scheme that hasn't been defeated. There is no internet filter that can't be bypassed, and no blocking that can't be dodged.
What the Chinese need to learn is that their efforts are as futile as attacking a funny f
Re:Whoopsie (Score:2, Insightful)
Chop searchy (Score:2)
oh well..... (Score:2, Funny)
O rly? (Score:2)
Scotty said it best... (Score:2)
"The more you overtake the plumbing, the easier it is to clog the drain."
China has a Maginot-Line mentality, and their censorship efforts will eventually fail just a miserably.
(ST flames and corrections, and French jokes, may commence now.)
On Behalf of Google, Freedom, and common sense (Score:4, Insightful)
Do you want to ruin it?
Come on, damnit! Shutupabout it.
Consider this the "getting your foot kicked under the table" move.
Re:On Behalf of Google, Freedom, and common sense (Score:2, Funny)
Re:On Behalf of Google, Freedom, and common sense (Score:5, Insightful)
Let's do a thought experiment.
On one side, we have a reasonably interesting search engine company.
On the other, we have a control-minded, autocratic government.
The search engine company (that wants to operate in China) is told by the autocratic government "We don't want Bad Things sneaking in through the search engine. Keep Bad Things out."
The search engine company says "OK. We'll play along. Give us a list of things you don't want to see. We'll get rid of them".
"Taiwan Independence" returns 0 results.
"Free Tibet" is delinked.
Various combinations of Tiananmen, 6 and 4 mysteriously vanish.
Unfortunately, Bad Things do not fit into nice little boxes. People mis-spell words. While it is easy to come up with a list of sites that contain Bad Things you do not want to see, new sites come up all the time. Is my friend's picture gallery from Tiananmen just some postcards to the folks back come, or is there some subtle political commentary in there? Well, you'll have to read it and find out.
If I search on (former Taiwanese president) Lee Teng-Hui, does that contain Bad Things? Does it link to Bad Things? How dangerous is a stooped 85 year-old former college professor anyhow?
Is Ghandi axiomatically Bad? Martin Luther King? Doesteyevsky? The list goes on and on and on.
The censors can control the obvious things. Ultimately, they will lose.
The real problem is that China is, for all its faults, a modern country. People come in, people fly out. When I go to China, lots of people ask what's going on in the outside world. I am a little circumspect in what I say, but my memory banks don't magically get erased when I cross over from Hong Kong to Shenzhen. Over 90% of the Chinese students you see toiling away at your local research university will ultimately go home. That's just the way it goes. They too don't forget whatever subversive thoughts may have crept into their heads during five or six years of study abroad.
The deck is stacked, and the good guys will ultimately win.
Does Google filter other languages? (Score:2, Interesting)
Uh. (Score:2)
B) This is an oversight that would be easily corrected.
C) You just announced it publically and unignorably.
D) Most of the people censored don't spell it with latin characters anyway.
Nothing to see here....SERIOUSLY (Score:2)
This is no intentional 'hack' of the system. It's a new content filter and there's going to be holes to be patched and creative solutions to be found for creative problems.
So before you go hail the Google dev team as being revolutionary, maybe you should consider they just missed the mark the first time around and have a l
Shhh! (Score:2)
Great! (Score:2)
I hope Google doesn't read Slashdot! (Score:2)
Thanks for blowing it for the Chinese...putting a link to some backwater news site on the front page of Slashdot.
On a more serious note, couldn't people who are not in China put up a little proxy to return Google results? For example, I have a domain hosting a few pages. Could I put a little script to take a query entered at my site and return results obtained from Google?
Re:I hope Google doesn't read Slashdot! (Score:2)
Pidgin English vs Piglatin (Score:2)
How to Hack Google's censor in China (Score:4, Informative)
This is what a chinese search for Democracy looks like after this method has been applied:
http://www.google.cn/search?hl=zh-CN&q=democracy+
Am I the only one... (Score:3, Insightful)
Capitalization (Score:2)
The streets find their own uses for technology. (Score:3, Insightful)
They were scooped by /. comments (Score:2)
Pictures (Score:2)
Searching for something as simple as "tank man" or "tank square" on GIS.cn will get you the pic you're looking for, btw. As long as you don't include "tiananmen" in the query, you'll get it.
Oh, and (Score:2)
Grammar Nazis (Score:2)
Could bring a whole new meaning to the expression "spelling/grammar nazi" if the Chicoms decide to start rejecting queries with too many non-OED words.
Thank you for your bug report. (Score:2)
This is so ironic! (Score:2)
This reminds me of the phrase: "Your famine is my feast".
How western media works vs. chinese free speech (Score:2)
Example: Teeanamen Skware.
An incorrect spelling like that gets published, say HERE, and is noted by some Chinese equivalent of Winston Smith in the Chinese Minitrue, and its passed over to the directorate for inclusion on words to ban. Eventually you run out of room to run, even
beating the filters? (Score:2)
A negative information flow coming up? (Score:2)
Google so far has been taking the high ground by saying in effect that the Chinese public now has more information than they previously had
Great now that it has been reported... (Score:2)
Re:slashdot (Score:2, Funny)
Re:slashdot (Score:2, Funny)
Re:I get tanks no matter what the search term (Score:2)
Re:I get tanks no matter what the search term (Score:2)
http://images.google.cn/images?q=%E5%A4%A9%E5%AE%
Re:I get tanks no matter what the search term (Score:2)
But you miss my point:
Do we get the same results when searching from inside the great wall as when searching from outside. The search engine can tell what country we come from (even roughly where my ISP P.O.P. is); it would be easy to change the filtering algorithm depending on this.
This would allow China to hide the exact extent of its filtering from the out
Re:I get tanks no matter what the search term (Score:2)
I was playing around with this yesterday after alec muffet had discovered something similar early Sun Morning:
http://www.crypticide.com/dropsafe/articles/securi ty/post20060129233439.comments [crypticide.com]
as in life, your view may vary depending on
Re:Poor Spelling? (Score:2)