A South Korean Chatbot Showed How Sloppy Tech Companies Can Be With User Data (slate.com) 11
A "Science of Love" app analyzed text conversations uploaded by its users to assess the degree of romantic feelings (based on the phrases and emojis used and the average response time). Then after more than four years, its parent company ScatterLab introduced a conversational A.I. chatbot called Lee-Luda — which it said had been trained on 10 billion such conversational logs.
But because it used billions of conversations from real people, its problems soon went beyond sexually explicit comments and "verbally abusive" language: It also soon became clear that the huge training dataset included personal and sensitive information. This revelation emerged when the chatbot began exposing people's names, nicknames, and home addresses in its responses. The company admitted that its developers "failed to remove some personal information depending on the context," but still claimed that the dataset used to train chatbot Lee-Luda "did not include names, phone numbers, addresses, and emails that could be used to verify an individual." However, A.I. developers in South Korea rebutted the company's statement, asserting that Lee-Luda could not have learned how to include such personal information in its responses unless they existed in the training dataset. A.I. researchers have also pointed out that it is possible to recover the training dataset from the AI chatbot. So, if personal information existed in the training dataset, it can be extracted by querying the chatbot.
To make things worse, it was also discovered that ScatterLab had, prior to Lee-Luda's release, uploaded a training set of 1,700 sentences, which was a part of the larger dataset it collected, on Github. Github is an open-source platform that developers use to store and share code and data. This Github training dataset exposed names of more than 20 people, along with the locations they have been to, their relationship status, and some of their medical information...
[T]his incident highlights the general trend of the A.I. industry, where individuals have little control over how their personal information is processed and used once collected. It took almost five years for users to recognize that their personal data were being used to train a chatbot model without their consent. Nor did they know that ScatterLab shared their private conversations on an open-source platform like Github, where anyone can gain access.
What makes this unusual, the article points out, is how the users became aware of just how much their privacy had actually been compromised. "[B]igger tech companies are usually much better at hiding what they actually do with user data, while restricting users from having control and oversight over their own data."
And "Once you give, there's no taking back."
But because it used billions of conversations from real people, its problems soon went beyond sexually explicit comments and "verbally abusive" language: It also soon became clear that the huge training dataset included personal and sensitive information. This revelation emerged when the chatbot began exposing people's names, nicknames, and home addresses in its responses. The company admitted that its developers "failed to remove some personal information depending on the context," but still claimed that the dataset used to train chatbot Lee-Luda "did not include names, phone numbers, addresses, and emails that could be used to verify an individual." However, A.I. developers in South Korea rebutted the company's statement, asserting that Lee-Luda could not have learned how to include such personal information in its responses unless they existed in the training dataset. A.I. researchers have also pointed out that it is possible to recover the training dataset from the AI chatbot. So, if personal information existed in the training dataset, it can be extracted by querying the chatbot.
To make things worse, it was also discovered that ScatterLab had, prior to Lee-Luda's release, uploaded a training set of 1,700 sentences, which was a part of the larger dataset it collected, on Github. Github is an open-source platform that developers use to store and share code and data. This Github training dataset exposed names of more than 20 people, along with the locations they have been to, their relationship status, and some of their medical information...
[T]his incident highlights the general trend of the A.I. industry, where individuals have little control over how their personal information is processed and used once collected. It took almost five years for users to recognize that their personal data were being used to train a chatbot model without their consent. Nor did they know that ScatterLab shared their private conversations on an open-source platform like Github, where anyone can gain access.
What makes this unusual, the article points out, is how the users became aware of just how much their privacy had actually been compromised. "[B]igger tech companies are usually much better at hiding what they actually do with user data, while restricting users from having control and oversight over their own data."
And "Once you give, there's no taking back."
âoeExposedâ??? (Score:2)
So this chatbot âoeexposedâ personal data *already* âoeexposedâ on the Intertubes...
Re: (Score:3)
It's not clear where they got this training data from, but often it is either scraped from websites or from things like email archives.
While much of it might have been public that doesn't mean it doesn't need sanitizing first. Just because someone is mentioned by name in a public setting doesn't mean you want their name to be part of the training data.
Re: Exposed??? (Score:1)
No, not previously exposed. Per TFA, the training set was more than 10 billion chat logs which people sent to the Science of Love app and paid about $4.50 each for an analysis of whether the counterparty had romantic feelings for them.
A CS bedtime story (Score:3)
Once upon a time, a luser asked me "can you give me a list of all the item numbers that aren't in the system?", and I said "sure, but it's going to take the lifetime of the universe to print out".
More "personal public information". (Score:1)
"Github is an open-source platform" (Score:2)
No, it isn't.
Imagine The Day (Score:2)
When the discussion you overhear in a cafe is "You put your *real* identity data online - Are You Crazy!?"
meh (Score:1)
My real personal data has been easy to find online since about 1997; I've had the same 2 or 3 usernames across about 30 different platforms since 1999, and I can count on one hand the passwords I've used with slight variations of symbols and numbers added, or not. Literally no one has ever messed with me, probably because I'm a nobody and I have nothing anyone would want. The only theft that's ever happened to me was some gamer managed to buy 30 dollars in Runescape gold via my bank account somehow, which b
Re: (Score:2)
Security is important (Score:1)