Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Google Admits to Using Sohu Database

Posted by CowboyNeal on Mon Apr 09, 2007 06:46 PM
from the cut-and-paste dept.
prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"

Related Stories

[+] Google Faces Plagiarism Questions Over Chinese Software 187 comments
yaohua2000 writes "Google's laboratory in China has launched its first product, a Pinyin Input Method Editor. The software allows the romanized characters to be translated to more traditional Chinese symbols , via entering on a QWERTY keyboard. Users soon discovered that the data Google used for the product was unusually similar to the data used by a Chinese rival, Sogou. Google has evaded the question about software similarities, reports PC World. 'The similarities, which included an error involving the name of a celebrity, were noted on a Google Labs discussion board about its Pinyin IME. Users noted that entering the Pinyin pinggong into the Google IME incorrectly produced the name of Feng Gong, an actor and comedian.'"
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Is this... (Score:1, Insightful)

    by Hsensei (1055922) on Monday April 09 2007, @06:50PM (#18669419)
    (http://google.com/)
    Google doing evil, or sticking it to evil?
    • Re:Is this... by Anonymous Coward (Score:1) Monday April 09 2007, @07:12PM
    • Re:Is this... by renegadesx (Score:1) Monday April 09 2007, @07:47PM
      • Re:Is this... by Anonymous Coward (Score:1) Monday April 09 2007, @08:08PM
      • Doing evil to combat evil.. by iendedi (Score:2) Wednesday April 11 2007, @05:30AM
      • Re:Is this... by renegadesx (Score:1) Monday April 09 2007, @11:01PM
        • 1 reply beneath your current threshold.
      • Re:Is this... by jstomel (Score:3) Monday April 09 2007, @11:26PM
        • Re:Is this... (Score:5, Interesting)

          by 808140 (808140) on Tuesday April 10 2007, @12:43AM (#18671923)
          No, actually, "gook" is a term that originated in the Korean war for Korean people. Because many of the soldiers who fought in the Korean war were officers in the Vietnam war, their racial slurs were adopted and modified by a new generation, leading to great confusion about the origins of the term.

          The etymology of the word gook is interesting, because it may be one of the few racial slurs that originated with a people's term for themselves. In Korean, guk means "country" and by extension a country's people; when it is not modified (cf. waiguk, outside country, foreigner) it is understood to be Korea or its peoples. Speakers of Chinese will recognize the word as having sintic origin (gúo, country, and wàigúo, foreign country, respectively, in Mandarin).

          The term was appropriated by the Americans during the Korean war and used as a racial slur for Korean people in general, which must have been confusing to the Koreans (imagine someone using "American" as a slur for Americans to get an idea). Then, in Vietnam, the old "Asians are all the same" mentality prompted GIs to extend its meaning (imagine "American" being a racial slur for all white people, for example -- yes, I know many Americans aren't white, it's not a perfect analogy, deal with it).
          [ Parent ]
          • Re:Is this... by Anonymous Coward (Score:2) Tuesday April 10 2007, @04:48AM
            • 1 reply beneath your current threshold.
          • Re:Is this... by AlecC (Score:2) Tuesday April 10 2007, @06:07AM
          • Re:Is this... by Dextrously (Score:1) Tuesday April 10 2007, @10:26AM
          • 2 replies beneath your current threshold.
      • 2 replies beneath your current threshold.
    • Re:Is this... by rm69990 (Score:2) Tuesday April 10 2007, @02:12AM
    • Re:Is this... by ajs (Score:2) Tuesday April 10 2007, @07:53AM
    • Re:Is this... by zippthorne (Score:2) Tuesday April 10 2007, @12:51AM
    • 1 reply beneath your current threshold.
  • Dictionary mistakes. (Score:5, Funny)

    by Tackhead (54550) on Monday April 09 2007, @06:53PM (#18669449)
    > Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents.

    ...including the ones for "plagiarize", "research", and apparently a new one for the 2000s under "leverage".

    Leverage! Leverage!
    Let no one else's work cut short your edge,
    Against the truth you can surely hedge,
    So don't cut short your edge,
    But leverage, leverage, leverage!

    (One man deserves the credit! One man deserves the blame!
    And Sergei Brin Ivanovich Lobachevsky is his name!)

  • Google's initial explanation (Score:5, Funny)

    by Anonymous Coward on Monday April 09 2007, @06:55PM (#18669459)
    "In the future, Google invents a time machine that's used by a rogue employee to travel back in time to give Sohu this database. It's clear then that Sohu stole our database."
  • Have no fear! (Score:2)

    by mattgreen (701203) on Monday April 09 2007, @06:55PM (#18669463)
    I'm sure someone will step up and help them save face in this embarrassing situation! When in doubt, you can always try to change the subject, that has worked well in the previous thread. Now that I think about it, we need a RoughlyDrafted-esque site for Google, anyone up to the task?
  • This reminds me of (Score:5, Interesting)

    by Diordna (815458) on Monday April 09 2007, @06:57PM (#18669475)
    (http://filer.case.edu/srj15)
    "Stolen from Apple Computer" (whole story [folklore.org])
  • by slashbob22 (918040) on Monday April 09 2007, @06:58PM (#18669481)
    I guess Google Labs will have to subscribe to Turnitin.com now.
  • So... (Score:5, Interesting)

    by Anonymous Coward on Monday April 09 2007, @06:58PM (#18669491)
    When caught making a mistake, they admit it, work to resolve it, and move on?
    I think there are a few other companies who could learn from that approach ...
  • Cmon Google... (Score:3, Funny)

    by Anonymous Coward on Monday April 09 2007, @07:02PM (#18669519)
    surely after helping so many students copy their research papers you should know the number 1 rule of copying another persons work: Change the F*CKING NAME!
  • I wonder... (Score:2, Interesting)

    by flyboy81 (698817) on Monday April 09 2007, @07:04PM (#18669531)
    Is this a single isolated incident or simply the first one of more coming from the company that does no evil?
    • Re:I wonder... by themushroom (Score:1) Monday April 09 2007, @07:36PM
    • Re:I wonder... by AmberBlackCat (Score:2) Monday April 09 2007, @09:50PM
  • Time for a slogan change? (Score:5, Funny)

    by GFree (853379) on Monday April 09 2007, @07:12PM (#18669587)
    "Do no evil"

    should be changed to

    "Do just a tiny bit of evil"

    which at this rate will probably end up as

    "All your web are belong to us"
  • Car stereo (Score:4, Funny)

    by DogDude (805747) on Monday April 09 2007, @07:17PM (#18669623)
    (http://phydeauxpets.com/)
    So then, did the guy who stole my car stereo, was he "leveraging some non-car thief assets"?
    • Re:Car stereo by iminplaya (Score:2) Monday April 09 2007, @07:57PM
      • Re:Car stereo by iminplaya (Score:1) Tuesday April 10 2007, @12:06AM
      • 1 reply beneath your current threshold.
  • New tag: copyvio (Score:1)

    by Matt Perry (793115) on Monday April 09 2007, @07:22PM (#18669643)
    I recommend tagging this "copyvio [urbandictionary.com]"
  • Do no evil (Score:5, Insightful)

    by z-j-y (1056250) on Monday April 09 2007, @07:26PM (#18669671)
    Google is going to release a statement that stealing code/data is not evil in China, and Google must fit in local cultures and abide by local laws.

    Seriously, this is just pathetic. I am appalled by the Google apologists on slashdot.

    Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough? They did this by stealing data and who knows what from others. Mind you that the data is not publicly available, so Google must have committed certain crimes to obtain the data.

    For those who don't see what's the big deal: the mapping from ASCII sequence to Chinese character/phrase is not trivial; actually it is what Chinese input is all about.
    • Re:Do no evil by maxume (Score:2) Monday April 09 2007, @07:41PM
      • Re:Do no evil by QuantumG (Score:3) Monday April 09 2007, @07:56PM
        • Re:Do no evil by Nazlfrag (Score:1) Monday April 09 2007, @09:59PM
    • Re:Do no evil by homer_s (Score:2) Monday April 09 2007, @08:01PM
      • Re:Do no evil by The_Wilschon (Score:2) Monday April 09 2007, @09:15PM
    • Re:Do no evil (Score:5, Insightful)

      by ShawnDoc (572959) on Monday April 09 2007, @08:07PM (#18669943)
      (http://www.pornforthemind.com/)
      This is a serious problem when dealing with Chinese companies. Now that Google has opened offices in China and has staffed them with native Chinese people, they're going to have a hard time enforcing western style ideas about copyright and what constitutes "doing no evil". Its a problem we've run into in the past with our Chinese operations. The way the problem was "solved", by removing the engineers names, but still clearly using the other company's engine (they didn't remove the identical bugs), is something I have seen happen in the past when dealing with our R&D team in China when we've found them using code they "borrowed" either from open source code or from an engineers past employer. I've never seen it handled in public like this however. Google is going to need to take some serious Q&A steps in their Chinese offices to keep stuff like this from happening again or else risk their Chinese office ruining the entire company's reputation.
      [ Parent ]
      • Re:Do no evil by adelord (Score:1) Monday April 09 2007, @11:05PM
        • Re:Do no evil by ioshhdflwuegfh (Score:1) Tuesday April 10 2007, @12:09PM
      • Re:Do no evil by ioshhdflwuegfh (Score:1) Tuesday April 10 2007, @12:22PM
      • 2 replies beneath your current threshold.
    • Re:Do no evil by ReallyEvilCanine (Score:3) Monday April 09 2007, @08:29PM
      • Re:Do no evil by Achromatic1978 (Score:2) Monday April 09 2007, @09:15PM
      • Re:Do no evil by ioshhdflwuegfh (Score:1) Tuesday April 10 2007, @12:44PM
      • 1 reply beneath your current threshold.
    • Re:Do no evil by asninn (Score:2) Tuesday April 10 2007, @03:38AM
    • 1 reply beneath your current threshold.
  • by pcause (209643) on Monday April 09 2007, @07:27PM (#18669673)
    Ok, so we do do some evil, but jusy with our competitor's code. That isn't so bad, is it?
  • OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?

    I suspect that there's more to this story that we're not hearing.
    • by tooyoung (853621) on Monday April 09 2007, @08:45PM (#18670171)

      OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?

      I suspect that there's more to this story that we're not hearing.


      Exactly. Reading 95% of the comments for this story and yesterday's story, everyone seems to think that this is about stealing code. This is about Google using the same data to train an algorithm. Both algorithms make the same mistakes because they were trained using the same data, which contained incorrectly labled information. It is whether or not this data was publicly available that is the issue.

      For (a horribly contrived) example: Lets say that I write some hand writing recognition software using a neural-net. In order to train my software, I use a large database of handwriting samples that I have found on the web. However, the person that compiled this database made the mistake of labeling all of the sample images of the letter 'n' as the letter 'q', and all of the images of the letter 'q' are labeled as the letter 'n'. Person B comes along and uses the same data set to train a naïve-Bayes classifier. Guess what? Both algorithms will make the same mistakes when it comes to the letters 'n' and 'q'. Not because I stole code from Person B, but because we used the same training data.

      I'm not defending Google at all here. If they stole the data from Sohu, they should get in trouble. Based on the fact that Google is in the web-mining business, I would guess that they just grabbed this data off of the net, and someone forgot to think about if they had the right to use it.
      [ Parent ]
    • Re:Exactly how did they get a copy of the DB? by Hucko (Score:1) Tuesday April 10 2007, @04:16AM
  • this is quite troubling (Score:3, Insightful)

    by martin-boundary (547041) on Monday April 09 2007, @07:34PM (#18669705)
    It is clear from this example that _some_ Google engineers have not the first clue about what clean room engineering [groklaw.net] is and when it should be used. Everyone in the software industry is under pressure to produce, that doesn't mean cutting corners is acceptable.

    This reminds me of the recent story about GPL code found in OpenBSD [slashdot.org]. There too, an OpenBSD developer took someone else's code and started modifying it without keeping the GPL license. He apparently thought it was ok to do this as long as all the offending functions would be renamed in the final release, but was caught checking in unmodified functions by accident.

    Google is well known for using a lot of GPL software, but it is also true that they do not distribute the source code of their flagship programs to the public. Episodes like this make people wonder if they "accidentally" use some GPL code in their distributed products without telling anyone.

  • Ironic (Score:5, Funny)

    by smackt4rd (950154) on Monday April 09 2007, @07:42PM (#18669775)
    So now american companies are pirating chinese software? Oh the irony! :)
  • Their new spokesperson ... (Score:2, Funny)

    by myster0n (216276) on Monday April 09 2007, @08:14PM (#18669985)
    ... Theo De Raadt says that the Chinese are INHUMAN.

    *ducks*
  • Were the errors intentional? (Score:4, Informative)

    by SuperBanana (662181) on Monday April 09 2007, @08:22PM (#18670025)

    If you ask around in the GIS/mapping community, it's known that the [street] map data providers (Delorme, Garmin, etc) will insert garbage data here and there. A street name is slightly wrong, or they have a mystery street that doesn't exist in the real world. They use it to try and tell if/when someone steals their data. If Zyugyz Road in Somecity, CA exists- the legal team fires at will.

    It's kind of weird, considering that most mapping companies do little more than get their hands on town/county/state GIS data for cheap, massage it a bit, then charge assloads of money for it.

  • Shame! (Score:3, Funny)

    by BluBall (16231) on Monday April 09 2007, @08:24PM (#18670047)
    Following the protocols established by the recent OpenBSD/Linux Broadcom driver fiasco, the proper response would be to denounce Sohu for having been ripped off by Google.

    Shame on you Sohu! This is inhuman!
  • Right! Google is evil! (Score:4, Insightful)

    by SEE (7681) on Monday April 09 2007, @08:33PM (#18670097)
    (http://jargon-file.org/)
    After all, we know that all Google employees are under Total Management Mind Control, and that Google Knows Everything Everyone's Doing. It's not even remotely possible that a handful of Google employees in China could shadily cut corners (using an already-extant database instead of compiling one from their own company's data) without Sergey Brin and Larry Page having personally authorized it from Mountain View, or that it would actually take a bit of time for upper management to investigate an issue when it's uncovered.
  • Please tell me (Score:1)

    by Mazin07 (999269) on Monday April 09 2007, @08:46PM (#18670177)
    (http://www.aztekera.com/)
    How is Google's pinyin IME better than the tons of other pinyin IMEs out there? I tried it, and apart from having a search button, it doesn't seem to be a whole lot better than the Microsoft Pinyin IME that comes with Windows.

    How does Google plan to set themselves apart from the rest of the competition and, even better, how does this fit into the "big picture"? Will the mass of adopters suddenly begin using Google search because it's built into their IME?
  • Tutorial on Chinese input (Score:5, Informative)

    by microbee (682094) on Monday April 09 2007, @09:02PM (#18670255)
    There are a lot of misundertstandings about how IME works and how Google copied non-public databases. So let me explain.

    IME accepts keyboard input and converts it into certain language characters. There are many different input methods that decide how to generate Chinese characters by using English keyboards, and pinyin is one of them (and the most popular one).

    pinyin is popular because it's simple and bears almost no learning curve. However, it suffers the problem of aliasing. For example, "shi" under pinyin will convert into "" "" "" ... in general, the same sequence could map to many different words (could be several dozens), and you usually need to select from them by choosing 1, 2, 3, ...(the input bar will display them from which you could choose, somtimes needing page-down). A native implementation of pinyin is thus very slow and cumbersome to use.

    A good implementation uses following approaches:
    1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).
    2. allow partial input for common phrases. This inputs a whole phrase at once, each character only requiring the first English letters. It speeds up input significantly.

    So the quality of the pinyin method depends heavily on how well the input could guess and prioritize the guesses, and thus the dictionary that is being used. And generating this dictionary (keeping it both contemporary and accurate) takes a lot of time.

    The dictionary is typically distributed together with the input method (or it wouldn't work). You could obtain sohu's dictionary by just installing its input method, and Google has likely obtained it this way. However, I don't think it's in an open-standard format, so Google probably has done certain reverse-engineering to be able to actually use it in its own software.

  • That shouldn't be copyrightable (Score:5, Interesting)

    by wrook (134116) on Monday April 09 2007, @09:03PM (#18670265)
    (http://mangahowto.dnsdojo.org/howto/)
    I've been thinking about this. Throwing the evilness of Google aside for a moment, why should someone be able to copyright a listing of the phonetic pronunciation of an alphabet?

    Let's just imagine how I might create this list. I would have to hire people who spoke the Chinese. Then I would ask them to record the pronunciation of each character that they know. This is pretty easy because in Chinese each character has only one pronunciation (per dialect, anyway). There are about 3500 characters that you need to know in order to be literate. And all of these people would have learned these at school.

    But how did they learn them? Well, they had a textbook and they memorized the list from the textbook.

    Wait. I can't just memorize a list from one book and put it in another book. That's copyright infringement. In order for it not to be copyright infringement, I need to make sure that my sources all memorized the pronunciations from different sources. That's going to be difficult.

    But let's say I do that. Now I have a list of the 3500 most common characters. And with that, I've probably got 99% of everything that's in a newspaper. But that's probably not good enough. I probably want a list
    of say 60,000 characters. Otherwise it's pretty useless in a general sense. Uncommon characters are uncommon, but you *will* bump into the words over time.

    So where do I find these characters? Can I hire some guy that knows them all? It would be very difficult. The best place to look is in a book. But wait... what am I going to do? Every time I find a character my people don't know, look it up in a book? Why don't I just copy it from the book in the first place? That's just copyright infringement again.

    Really, the task of creating this list authoritatively without infringing copyright is monumental. Probably the *only* way to do it is with a community project where people just submit the pronunciations they know.

    But if I'm going to have a community project like this, what the heck do I need copyright for? What am I protecting? If everyone is going to contribute, everyone should benefit.

    So, personally, I don't think one should have copyright on this kind of material (same thing for spelling). It's just not in the public interest. This goes doubly so now that we have the internet and creating these kinds of projects is very inexpensive.

    OK, I've gone on long enough... But one more rant. What's with this "do no evil" thing? Isn't that setting the bar a little low. If I told my parents that I'd work hard not to be evil, I think they'd be somewhat disappointed in me. If Google wanted to actually "do some good" rather than "do no evil", they could start a community project to collect this data and share it with the world.

    Sigh... I guess we'll have to wait for some guy in his garage (but here's betting that someone has already started something).
  • by DarkLegacy (1027316) on Monday April 09 2007, @09:38PM (#18670461)
    (http://www.darklegacy.us/)
    > In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.

    And you thought Easter Eggs were just there for kicks. ;)

    • 1 reply beneath your current threshold.
  • by gatkinso (15975) on Monday April 09 2007, @09:38PM (#18670471)
    TURN ABOUT IS FAIR PLAY.

    Ok fine, we have stolen from them before... but Beef and Broccoli don't count.
  • Sorry, I was just leveraging some non-personal resources.
  • About time (Score:1)

    by XCondE (615309) on Monday April 09 2007, @10:11PM (#18670745)
    (http://microrants.blogspot.com/)

    Finally, the first (?) crack on the building appears.

    Am I just going to have to start-up my own evil-free(tm) company?

  • by PassBy (1086365) on Monday April 09 2007, @10:22PM (#18670813)
    The chief of Google China, Kai-Fu Li, used to be Microsoft's vice president, go figure...
  • by Keith McClary (14340) on Monday April 09 2007, @10:24PM (#18670837)
    In the US, a list of words in lexicographic order is not necessarily copyrightable (eg. phone books).

    Is it also so in China? And does China have laws making databases IP like the US?

    Americans seem to think that their bizarre and extreme notions of IP are universal law.

    Perhaps someone here is an expert on Chinese IP law - did Google-China do anything illegal?

  • Begs teh question. (Score:2, Funny)

    by Anonymous Coward on Monday April 09 2007, @11:37PM (#18671495)
    Sohu cares?
  • Google's response (Score:2, Funny)

    by Loconut1389 (455297) on Tuesday April 10 2007, @06:09AM (#18673153)
    (http://webtrotter.com/blog)
    The person responsible for the copying has been sacked. ...
    The person responsible for the sacking has been sacked...
  • Mistakes are (Score:2)

    by EmbeddedJanitor (597831) on Monday April 09 2007, @07:12PM (#18669585)
    The mistakes were the giveaway. Surely these are "creative works"?
    [ Parent ]
    • 1 reply beneath your current threshold.
  • Google may be filled with the best engineers, but once you move out of North America, they know nothing about ethics or morality.

    I'm curious how much time you've spent outside of North America, because I'm pretty sure 92% of the world population would disagree with you.
    [ Parent ]
  • Re:Do no evil? (Score:2)

    by mattgreen (701203) on Monday April 09 2007, @08:56PM (#18670237)
    But they SAID they weren't evil, therefore that MUST make them good! Or, at least, that is how I fit into my naive worldview! Everything is either absolutely evil (Microsoft) or absolutely good (Google). There is no in-between.
    [ Parent ]
  • Oblig futurama quote (Score:5, Funny)

    by pedantic bore (740196) on Monday April 09 2007, @09:03PM (#18670263)
    "The internet is about the free exchange of other people's ideas!"
    [ Parent ]
  • Tell you what, grab an M16 and man the borders. What the fuck piece of xenophobic, nationalistic tripe is this? "no more good people left in the world"?
    [ Parent ]
  • Re:Do no evil? (Score:4, Interesting)

    by setagllib (753300) on Monday April 09 2007, @10:00PM (#18670643)
    They're significantly reducing the lockin to Microsoft products, by encouraging, buying and thereafter funding web application projects that often overlap with what is currently locked in to Microsoft. They even brew some of their own sometimes. They continue the development of Linux and Python with a wide adoption of both. All of these things are creating wealth for everyone, and crippling Microsoft little by little, which we know is what we want. I'd much rather have a Google & Microsoft duopoly if it means Microsoft would finally have to clean up its shit and accomodate whatever open source platform Google would support in that scenario.
    [ Parent ]
  • by PassBy (1086365) on Monday April 09 2007, @10:42PM (#18671033)
    Dare not to use your real name eh, anonymous coward? The head of google china was educated in North America, he worked in North America and he was sent back to China by Microsoft. So where did he learn his engineering ethic? Do you want to compare the number of IT lawsuits going on in America and China? I have to give it to you though. That was a quick one! I can't imagine anyone able to strike so low so fast, except for someone that always have this little hate in mind.
    [ Parent ]
    • 1 reply beneath your current threshold.
  • 13 replies beneath your current threshold.