

Reddit Sues Perplexity For Scraping Data To Train AI System (reuters.com) 35
An anonymous reader shares a report: Social media platform Reddit sued AI startup Perplexity in New York federal court on Wednesday, accusing it and three other companies of unlawfully scraping its data to train Perplexity's AI-based search engine. Reddit said in the complaint that the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system.
Why? (Score:2)
Re: (Score:3)
The AI is just being trained to follow those patterns. It's not about generating answers that are accurate it's about generating something that looks like a human being might have written it.
Also sadly because Google is completely overrun by advertisements and low quality bot traffic reddit is the best place to get accurate information outside of a handful of extremely specific specialty
Re: (Score:2)
LLMs Can Get "Brain Rot"!
https://arxiv.org/abs/2510.139... [arxiv.org]
They use data from X, but I think Reddit comments won't be much different. Outgoing links is a different topic, though.
Re: (Score:1)
Re: Why? (Score:2)
at least three lies (Score:2)
"...the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system."
There are at least 3 lies in that one sentence alone.
Re: (Score:3)
"...the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system."
Really, what would those be?
Arguably they are in breach of the CFAA.. Reddit's robots.txt file:
# Welcome to Reddit's robots.txt
# Reddit believes in an open internet, but not the misuse of public content.
# See https://support.reddithelp.com... [reddithelp.com] Reddit's Public Content Policy for access and use restrictions to Reddit content.
# See https://www.reddit.com/r/reddi... [reddit.com] for details on how Reddit continues to support research and non-commercial use.
# policy: https://support.reddithelp.com... [reddithelp.com]
User-agent: * /
Disallow:
Re: (Score:1)
User-agent: *
Disallow: /
Hmmm. Based on that, anyone using any kind of a web browser shouldn't be viewing their web pages.
Which kind of takes the point out of having a web page...
Re: (Score:1)
Sure. If you completely ignore the context that it is only intended for automated crawlers and misread it as referring to everyone then, yes, I agree that would be ridiculous...
(It's also kind of ridiculous in these days to have an essentially-voluntary anti-bot system and expect it to do anything, but that's another topic.)
Re: (Score:1)
I still don't see how there's a basis to complain. (Score:2)
Re: (Score:3)
Reddit is claiming copyright to the posts made by their users. Or at least that's the likely legal justification for this.
There are also a whole bunch of weird business laws we never think about that exists to protect businesses from other businesses. Basically stuff lobbyists put in place to protect the interests of their employers.
T
Re: (Score:2)
Re: (Score:3)
Normally I would agree with you that the AC is a troll. But then so is rsilvergun, who is constantly pushing a communist agenda. It doesn't really matter what the topic is, he'll find a way to 1. turn it into a negative 2. tie that negative into being a result of capitalism's failure. I do disagree with the AC on some points though. For one, he's not a Chinese living in Singapore. Guaranteed he would have been deported from Singapore if he set foot. He's not a Chinese agent either, they wouldn't be so obvio
Re: I still don't see how there's a basis to compl (Score:1)
Re: (Score:2)
I suppose they could make it about server resources, especially if they can show that being hammered by LLMs is degrading the service. "You read this the wrong way
Re: (Score:3)
The difference depends on context, of course.
Generally speaking there are several cases to consider:
(1) Site requires agreeing on terms of service before browser can access content. In this case, scraping is a clear violation.
(2) Site terms of service forbid scraping content, but human visitors can view content and ...
(2a) site takes technical measures to exclude bots. In this case scraping is a no-no, but for a different reason: it violates the Computer Fraud and Abuse Act.
(2b) site takes no technical m
Re: (Score:2)
It's easy enough to look at Reddit's robots.txt file:
# Welcome to Reddit's robots.txt
# Reddit believes in an open internet, but not the misuse of public content.
# See https://support.reddithelp.com... [reddithelp.com] Reddit's Public Content Policy for access and use restrictions to Reddit content.
# See https://www.reddit.com/r/reddi... [reddit.com] for details on how Reddit continues to support research and non-commercial use.
# policy: https://support.reddithelp.com... [reddithelp.com]
User-agent: * /
Disallow:
Re: (Score:2)
Right, but as I just said obeying a disallow directive isn't legally mandatory, so it doesn't mean much.
Re: (Score:1)
I am increasingly seeing the argument from this side.
Perplexity and its ilk are just a new kind of web browser that is acting as your agent to pull content from publicly available web sites.
Why isn't Reddit going after web browser makers?
I am thinking that the next escalation in this fight is simply going to be a plugin for your local web browser than your AI chatbot can proxy requests through, or just directly access content through a built-in browser.
Re: (Score:1)
Does Reddit themselves "own" that data? (Score:1)
Re: (Score:3)
Can't they just build the A.I.s to cite their sources whenever it outputs something that has a definite source, or are we past all that since they've already used all this content as training data already.
If a Reddit post amounts to a human-summary of a StackOverflow disussion, which itself is a complilation from a forum posts on a discussion board and a Wordpress blogger, who got *their* information from man pages and error outputs...who do you cite? Each of them validates the others in order to minimize the amount of "SEO Blogger Spam" that also ended up in the meat grinder somewhere.
The problem with the meat grinder is that the whole point is essentially to make it impossible to trace sources to the point
Re: (Score:2)
Read the Reddit T&Cs, or that of any major website.
You give them a (sometimes limited) copyright permission to use your post by using the service to post anything publicly.
When is slashdot gonna sue? (Score:2)
When is slashdot gonna sue? We can't achieve superintelligence without scraping slashdot*.
*shove all the comments through an inverse function.
Re: (Score:2)
When is slashdot gonna sue?
I figure the new Cloudflare "I am not a robot" challenge has something to do with that. The first time I have to click the pictures with traffic lights, I'm done here.
DMCA part of complaint looks weak (Score:4, Interesting)
Reddit might have a good complaint about terms of service or CFAA or something. I don't know. But at least one part of their complaint looks like garbage:
Ah, DMCA, my old friend. Let's review some DCMA definitions from 1201(a)(3), but I'll add some emphasis:
It is here that I must mention that I happen to have a reddit account, and I am somewhat familiar with that website. And I never, ever authorized any technological measure to limit access to my posts/comments. That doesn't mean reddit can't do it, but reddit never asked me and I never authorized it, so whatever is being circumvented does not, therefore (by DMCA's own words), "effectively control access to a work" because the technological measure was never authorized by the copyright owner. I suspect that no reddit users have authorized this, or at most, only reddit employees have been ordered by their bosses to authorize it.
Furthermore, how do we know that the copyright owners don't authorize anyone to "avoid, bypass, remove, deactivate, or impair a technological measure" their copyrighted works? I authorize people to do that. (Indeed, my Slashdot sig below, is a reference to that.) I don't think I have ever said on reddit that I authorize it (the way i have done here on Slashdot) but if anyone (reddit?!?) ever bothers to ask me...
There seems to be some popular misunderstanding of DMCA, that it prohibits cracking DRM. But that's only true if the copyright owner authorized the DRM in the first place and also if they don't authorizing cracking it. Neither of those two required conditions apply in this case.
Re: (Score:2)
What makes that illegal? (Score:2)
If Reddit provides something on the internet, people can access it. Perplexity doesn't really train either, but processes search results to create an answer that is *not* in the model itself.
Yeah, Reddits stupid "network security" tries to block VPN users, but if they are unable to block Perplexity, it's not Perplexities problem, is it? They can make Reddit login only, then someone has to accept ToS, but as long as it is freely available as long as your IP is not on a blacklist, it's just the open web.
Re: (Score:2)
That's not how copyright has ever worked, by the way.