Reddit Sues Perplexity For Scraping Data To Train AI System (reuters.com) 37

Posted by msmash on Wednesday October 22, 2025 @02:05PM from the tussle-continues dept.

An anonymous reader shares a report: Social media platform Reddit sued AI startup Perplexity in New York federal court on Wednesday, accusing it and three other companies of unlawfully scraping its data to train Perplexity's AI-based search engine. Reddit said in the complaint that the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system.

Reddit Sues Perplexity For Scraping Data To Train AI System

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 37 Comments Log In/Create an Account

Comments Filter:

Why? (Score:2)

by denny_deluxe ( 1693548 ) writes:

Most of the Reddit AI slop 'answers' it gives me in searches are wrong anyway. Let them have their poison data.
- Re: (Score:3)
  
  by rsilvergun ( 571051 ) writes:
  
  It's not about the quality of the content it's about whether the content follows natural human language patterns.
  
  The AI is just being trained to follow those patterns. It's not about generating answers that are accurate it's about generating something that looks like a human being might have written it.
  
  Also sadly because Google is completely overrun by advertisements and low quality bot traffic reddit is the best place to get accurate information outside of a handful of extremely specific specialty
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    LLMs Can Get "Brain Rot"!
    https://arxiv.org/abs/2510.139... [arxiv.org]
    They use data from X, but I think Reddit comments won't be much different. Outgoing links is a different topic, though.
  - Re: (Score:1)
    
    by buck-yar ( 164658 ) writes:
    
    Maybe the enshitification is a good thing? Back to doing productive things and off this bad influence internet
- Re: Why? (Score:2)
  
  by Lobotomy656 ( 7554372 ) writes:
  
  From what I understand if you feed AI data to any type of LLM you get model collapse. Now, if you feed those LLMs data from the internet that's post-ChatGPT it's going to be increasingly riddled with AI. From reddit to Wikipedia, people are going to post what AI spills out. Nobody solves this problem, nobody talks about it and this is crucial to scaling LLMs with more data. But it turns out you need to use data until 2022, everything newer is "contaminated".
at least three lies (Score:2)

by dfghjk ( 711126 ) writes:

"...the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system."
There are at least 3 lies in that one sentence alone.
- Re: (Score:3)
  
  by whoever57 ( 658626 ) writes:
  
  "...the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system."
  Really, what would those be?
  Arguably they are in breach of the CFAA.. Reddit's robots.txt file:
  # Welcome to Reddit's robots.txt
  # Reddit believes in an open internet, but not the misuse of public content.
  # See https://support.reddithelp.com... [reddithelp.com] Reddit's Public Content Policy for access and use restrictions to Reddit content.
  # See https://www.reddit.com/r/reddi... [reddit.com] for details on how Reddit continues to support research and non-commercial use.
  # policy: https://support.reddithelp.com... [reddithelp.com]
  User-agent: *
  Disallow: /
  - - Re: (Score:1)
      
      by retchdog ( 1319261 ) writes:
      
      Sure. If you completely ignore the context that it is only intended for automated crawlers and misread it as referring to everyone then, yes, I agree that would be ridiculous...
      (It's also kind of ridiculous in these days to have an essentially-voluntary anti-bot system and expect it to do anything, but that's another topic.)
  - Re: (Score:1)
    
    by SlydogSZ ( 675605 ) writes:
    
    Robots.txt isn't a legally binding document.
I still don't see how there's a basis to complain. (Score:2)

by sabbede ( 2678435 ) writes:

Something read public pages on reddit. So do I. What's the difference?
- Re: (Score:3)
  
  by rsilvergun ( 571051 ) writes:
  
  Generally courts have said that just because you put something out on the internet doesn't mean you give up copyright to it even if you make it publicly accessible.
  
  Reddit is claiming copyright to the posts made by their users. Or at least that's the likely legal justification for this.
  
  There are also a whole bunch of weird business laws we never think about that exists to protect businesses from other businesses. Basically stuff lobbyists put in place to protect the interests of their employers.
  
  T
  - - Re: (Score:2)
      
      by sabbede ( 2678435 ) writes:
      
      Listen, you AC, lobbing insults at someone I'm trying to talk doesn't benefit anyone. If you want to sit there and waste our time, have the stones to put a name on it.
      - Re: (Score:2, Funny)
        
        by backslashdot ( 95548 ) writes:
        
        Normally I would agree with you that the AC is a troll. But then so is rsilvergun, who is constantly pushing a communist agenda. It doesn't really matter what the topic is, he'll find a way to 1. turn it into a negative 2. tie that negative into being a result of capitalism's failure. I do disagree with the AC on some points though. For one, he's not a Chinese living in Singapore. Guaranteed he would have been deported from Singapore if he set foot. He's not a Chinese agent either, they wouldn't be so obvio
    - Re: I still don't see how there's a basis to compl (Score:1)
      
      by Disco Ninja ( 7135795 ) writes:
      
      Well actually courts also enforce laws so it seems you do not understand English but not a surprise since you are American. Within the first paragraph, congress dot gov statesâ¦ several federal district courts have enjoined enforcement of some of the challenged policies. https://www.congress.gov/crs-p... [congress.gov]
  - Re: (Score:2)
    
    by sabbede ( 2678435 ) writes:
    
    Yeah, but how is a copyright being violated? If Reddit can claim they have a copyright on user posts (which seems silly and like they'd screw themselves out of sec.230), how are they going to show those copyrights were violated? Is Perplexity's AI presenting exact copies as its own work, or is it producing inherently non-infringing summaries?
    I suppose they could make it about server resources, especially if they can show that being hammered by LLMs is degrading the service. "You read this the wrong way
- Re: (Score:3)
  
  by hey! ( 33014 ) writes:
  
  The difference depends on context, of course.
  Generally speaking there are several cases to consider:
  (1) Site requires agreeing on terms of service before browser can access content. In this case, scraping is a clear violation.
  (2) Site terms of service forbid scraping content, but human visitors can view content and ...
  (2a) site takes technical measures to exclude bots. In this case scraping is a no-no, but for a different reason: it violates the Computer Fraud and Abuse Act.
  (2b) site takes no technical m
  - Re: (Score:2)
    
    by whoever57 ( 658626 ) writes:
    
    It's easy enough to look at Reddit's robots.txt file:
    # Welcome to Reddit's robots.txt
    # Reddit believes in an open internet, but not the misuse of public content.
    # See https://support.reddithelp.com... [reddithelp.com] Reddit's Public Content Policy for access and use restrictions to Reddit content.
    # See https://www.reddit.com/r/reddi... [reddit.com] for details on how Reddit continues to support research and non-commercial use.
    # policy: https://support.reddithelp.com... [reddithelp.com]
    User-agent: *
    Disallow: /
    - Re: (Score:2)
      
      by hey! ( 33014 ) writes:
      
      Right, but as I just said obeying a disallow directive isn't legally mandatory, so it doesn't mean much.
- Re: (Score:1)
  
  by The-Ixian ( 168184 ) writes:
  
  I am increasingly seeing the argument from this side.
  Perplexity and its ilk are just a new kind of web browser that is acting as your agent to pull content from publicly available web sites.
  Why isn't Reddit going after web browser makers?
  I am thinking that the next escalation in this fight is simply going to be a plugin for your local web browser than your AI chatbot can proxy requests through, or just directly access content through a built-in browser.
- Re: (Score:1)
  
  by timeOday ( 582209 ) writes:
  
  An additional level of this is that Reddit is itself crowsourced. They didn't pay anybody to write their content. I'm more sympathetic to something like the Encyclopedia Britannica or NYT or the movie studios. (Although even there I agree it's still debatable).
Does Reddit themselves "own" that data? (Score:1)

by PaddirN ( 567657 ) writes:

Since all of that data comes entirely from posts by users, can reddit itself claim to own any of the information that they have on their website (outside of whatever stupid TOS crap they have that says whatever you post is theirs)? Since the public are by and large the originators of all of their content, it's not like they put in the work for that content that Perplexity and others are scraping. The bigger issue it seems like is the lack of attribution, with Perplexity and others frequently not citing whe
- Re: (Score:3)
  
  by Voyager529 ( 1363959 ) writes:
  
  Can't they just build the A.I.s to cite their sources whenever it outputs something that has a definite source, or are we past all that since they've already used all this content as training data already.
  If a Reddit post amounts to a human-summary of a StackOverflow disussion, which itself is a complilation from a forum posts on a discussion board and a Wordpress blogger, who got *their* information from man pages and error outputs...who do you cite? Each of them validates the others in order to minimize the amount of "SEO Blogger Spam" that also ended up in the meat grinder somewhere.
  The problem with the meat grinder is that the whole point is essentially to make it impossible to trace sources to the point
- Re: (Score:2)
  
  by ledow ( 319597 ) writes:
  
  Read the Reddit T&Cs, or that of any major website.
  You give them a (sometimes limited) copyright permission to use your post by using the service to post anything publicly.
When is slashdot gonna sue? (Score:2)

by backslashdot ( 95548 ) writes:

When is slashdot gonna sue? We can't achieve superintelligence without scraping slashdot*.
*shove all the comments through an inverse function.
- Re: (Score:2)
  
  by CubicleZombie ( 2590497 ) writes:
  
  When is slashdot gonna sue?
  I figure the new Cloudflare "I am not a robot" challenge has something to do with that. The first time I have to click the pictures with traffic lights, I'm done here.
DMCA part of complaint looks weak (Score:4, Interesting)

by Sloppy ( 14984 ) writes: on Wednesday October 22, 2025 @03:22PM (#65743750) Homepage Journal

Reddit might have a good complaint about terms of service or CFAA or something. I don't know. But at least one part of their complaint looks like garbage:
7. Congress has enacted laws to prevent exactly what Defendants are doing:
circumventing or bypassing technological measures that effectively control access to copyrighted
works. See Digital Millennium Copyright Act, 17 U.S.C. 1201, et seq. Each of the Defendants
in this action is profiting by evading technological control measures to access Reddit data it
knows it does not have permission to access or use. Because Reddit has always believed in the
open internet, it takes its role as a steward of its users’ communities, discussions, and authentic
human discourse seriously. Through this action, Reddit seeks to end Defendants’ circumvention
of security measures protecting Reddit data, blatant misuse of Reddit content, and disrespect for
its users’ rights, all of which harm Reddit and its hundreds of thousands of authentic human
communities.
Ah, DMCA, my old friend. Let's review some DCMA definitions from 1201(a)(3), but I'll add some emphasis:
(3) As used in this subsection—
(A) to “circumvent a technological measure” means to descramble a scrambled work, to decrypt an encrypted work, or otherwise to avoid, bypass, remove, deactivate, or impair a technological measure, without the authority of the copyright owner; and
(B) a technological measure “effectively controls access to a work” if the measure, in the ordinary course of its operation, requires the application of information, or a process or a treatment, with the authority of the copyright owner, to gain access to the work.
It is here that I must mention that I happen to have a reddit account, and I am somewhat familiar with that website. And I never, ever authorized any technological measure to limit access to my posts/comments. That doesn't mean reddit can't do it, but reddit never asked me and I never authorized it, so whatever is being circumvented does not, therefore (by DMCA's own words), "effectively control access to a work" because the technological measure was never authorized by the copyright owner. I suspect that no reddit users have authorized this, or at most, only reddit employees have been ordered by their bosses to authorize it.
Furthermore, how do we know that the copyright owners don't authorize anyone to "avoid, bypass, remove, deactivate, or impair a technological measure" their copyrighted works? I authorize people to do that. (Indeed, my Slashdot sig below, is a reference to that.) I don't think I have ever said on reddit that I authorize it (the way i have done here on Slashdot) but if anyone (reddit?!?) ever bothers to ask me...
There seems to be some popular misunderstanding of DMCA, that it prohibits cracking DRM. But that's only true if the copyright owner authorized the DRM in the first place and also if they don't authorizing cracking it. Neither of those two required conditions apply in this case.

- Re: (Score:2)
  
  by aaronb1138 ( 2035478 ) writes:
  
  If they have the technology to descramble the average reddit post, they're already sitting on trillion-dollar AGI tech.
What makes that illegal? (Score:2)

by allo ( 1728082 ) writes:

If Reddit provides something on the internet, people can access it. Perplexity doesn't really train either, but processes search results to create an answer that is *not* in the model itself.
Yeah, Reddits stupid "network security" tries to block VPN users, but if they are unable to block Perplexity, it's not Perplexities problem, is it? They can make Reddit login only, then someone has to accept ToS, but as long as it is freely available as long as your IP is not on a blacklist, it's just the open web.
- Re: (Score:2)
  
  by ledow ( 319597 ) writes:
  
  That's not how copyright has ever worked, by the way.
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    That's not what copyright works like for consumption. But LLM training doesn't need licenses because it is transformative use (There are ongoing court cases, but up to now no judge ruled otherwise), so you are allowed to use anything you can legally obtain. That's why Anthropic got into trouble not for training, but for torrenting.
    - Re: (Score:2)
      
      by ledow ( 319597 ) writes:
      
      One jurisdiction does not form international law.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Reddit Sues Perplexity For Scraping Data To Train AI System (reuters.com) 37

Reddit Sues Perplexity For Scraping Data To Train AI System More Login

Reddit Sues Perplexity For Scraping Data To Train AI System

Why? (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: Why? (Score:2)

at least three lies (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:1)

I still don't see how there's a basis to complain. (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2, Funny)

Re: I still don't see how there's a basis to compl (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Does Reddit themselves "own" that data? (Score:1)

Re: (Score:3)

Re: (Score:2)

When is slashdot gonna sue? (Score:2)

Re: (Score:2)

DMCA part of complaint looks weak (Score:4, Interesting)

Re: (Score:2)

What makes that illegal? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot