Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
AI The Courts

Reddit Sues Perplexity For Scraping Data To Train AI System (reuters.com) 35

An anonymous reader shares a report: Social media platform Reddit sued AI startup Perplexity in New York federal court on Wednesday, accusing it and three other companies of unlawfully scraping its data to train Perplexity's AI-based search engine. Reddit said in the complaint that the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system.

Reddit Sues Perplexity For Scraping Data To Train AI System

Comments Filter:
  • Most of the Reddit AI slop 'answers' it gives me in searches are wrong anyway. Let them have their poison data.
    • It's not about the quality of the content it's about whether the content follows natural human language patterns.

      The AI is just being trained to follow those patterns. It's not about generating answers that are accurate it's about generating something that looks like a human being might have written it.

      Also sadly because Google is completely overrun by advertisements and low quality bot traffic reddit is the best place to get accurate information outside of a handful of extremely specific specialty
    • From what I understand if you feed AI data to any type of LLM you get model collapse. Now, if you feed those LLMs data from the internet that's post-ChatGPT it's going to be increasingly riddled with AI. From reddit to Wikipedia, people are going to post what AI spills out. Nobody solves this problem, nobody talks about it and this is crucial to scaling LLMs with more data. But it turns out you need to use data until 2022, everything newer is "contaminated".
  • "...the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system."

    There are at least 3 lies in that one sentence alone.

    • "...the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system."

      Really, what would those be?

      Arguably they are in breach of the CFAA.. Reddit's robots.txt file:
      # Welcome to Reddit's robots.txt
      # Reddit believes in an open internet, but not the misuse of public content.
      # See https://support.reddithelp.com... [reddithelp.com] Reddit's Public Content Policy for access and use restrictions to Reddit content.
      # See https://www.reddit.com/r/reddi... [reddit.com] for details on how Reddit continues to support research and non-commercial use.
      # policy: https://support.reddithelp.com... [reddithelp.com]

      User-agent: *
      Disallow: /

      • User-agent: *
        Disallow: /

        Hmmm. Based on that, anyone using any kind of a web browser shouldn't be viewing their web pages.

        Which kind of takes the point out of having a web page...

        • Sure. If you completely ignore the context that it is only intended for automated crawlers and misread it as referring to everyone then, yes, I agree that would be ridiculous...

          (It's also kind of ridiculous in these days to have an essentially-voluntary anti-bot system and expect it to do anything, but that's another topic.)

      • Robots.txt isn't a legally binding document.
  • Something read public pages on reddit. So do I. What's the difference?
    • Generally courts have said that just because you put something out on the internet doesn't mean you give up copyright to it even if you make it publicly accessible.

      Reddit is claiming copyright to the posts made by their users. Or at least that's the likely legal justification for this.

      There are also a whole bunch of weird business laws we never think about that exists to protect businesses from other businesses. Basically stuff lobbyists put in place to protect the interests of their employers.

      T
      • Yeah, but how is a copyright being violated? If Reddit can claim they have a copyright on user posts (which seems silly and like they'd screw themselves out of sec.230), how are they going to show those copyrights were violated? Is Perplexity's AI presenting exact copies as its own work, or is it producing inherently non-infringing summaries?

        I suppose they could make it about server resources, especially if they can show that being hammered by LLMs is degrading the service. "You read this the wrong way

    • by hey! ( 33014 )

      The difference depends on context, of course.

      Generally speaking there are several cases to consider:

      (1) Site requires agreeing on terms of service before browser can access content. In this case, scraping is a clear violation.

      (2) Site terms of service forbid scraping content, but human visitors can view content and ...
      (2a) site takes technical measures to exclude bots. In this case scraping is a no-no, but for a different reason: it violates the Computer Fraud and Abuse Act.
      (2b) site takes no technical m

    • I am increasingly seeing the argument from this side.

      Perplexity and its ilk are just a new kind of web browser that is acting as your agent to pull content from publicly available web sites.

      Why isn't Reddit going after web browser makers?

      I am thinking that the next escalation in this fight is simply going to be a plugin for your local web browser than your AI chatbot can proxy requests through, or just directly access content through a built-in browser.

    • An additional level of this is that Reddit is itself crowsourced. They didn't pay anybody to write their content. I'm more sympathetic to something like the Encyclopedia Britannica or NYT or the movie studios. (Although even there I agree it's still debatable).
  • Since all of that data comes entirely from posts by users, can reddit itself claim to own any of the information that they have on their website (outside of whatever stupid TOS crap they have that says whatever you post is theirs)? Since the public are by and large the originators of all of their content, it's not like they put in the work for that content that Perplexity and others are scraping. The bigger issue it seems like is the lack of attribution, with Perplexity and others frequently not citing whe
    • Can't they just build the A.I.s to cite their sources whenever it outputs something that has a definite source, or are we past all that since they've already used all this content as training data already.

      If a Reddit post amounts to a human-summary of a StackOverflow disussion, which itself is a complilation from a forum posts on a discussion board and a Wordpress blogger, who got *their* information from man pages and error outputs...who do you cite? Each of them validates the others in order to minimize the amount of "SEO Blogger Spam" that also ended up in the meat grinder somewhere.

      The problem with the meat grinder is that the whole point is essentially to make it impossible to trace sources to the point

    • by ledow ( 319597 )

      Read the Reddit T&Cs, or that of any major website.

      You give them a (sometimes limited) copyright permission to use your post by using the service to post anything publicly.

  • When is slashdot gonna sue? We can't achieve superintelligence without scraping slashdot*.

    *shove all the comments through an inverse function.

    • When is slashdot gonna sue?

      I figure the new Cloudflare "I am not a robot" challenge has something to do with that. The first time I have to click the pictures with traffic lights, I'm done here.

  • by Sloppy ( 14984 ) on Wednesday October 22, 2025 @03:22PM (#65743750) Homepage Journal

    Reddit might have a good complaint about terms of service or CFAA or something. I don't know. But at least one part of their complaint looks like garbage:

    7. Congress has enacted laws to prevent exactly what Defendants are doing:
    circumventing or bypassing technological measures that effectively control access to copyrighted
    works. See Digital Millennium Copyright Act, 17 U.S.C. 1201, et seq. Each of the Defendants
    in this action is profiting by evading technological control measures to access Reddit data it
    knows it does not have permission to access or use. Because Reddit has always believed in the
    open internet, it takes its role as a steward of its users’ communities, discussions, and authentic
    human discourse seriously. Through this action, Reddit seeks to end Defendants’ circumvention
    of security measures protecting Reddit data, blatant misuse of Reddit content, and disrespect for
    its users’ rights, all of which harm Reddit and its hundreds of thousands of authentic human
    communities.

    Ah, DMCA, my old friend. Let's review some DCMA definitions from 1201(a)(3), but I'll add some emphasis:

    (3) As used in this subsection—
    (A) to “circumvent a technological measure” means to descramble a scrambled work, to decrypt an encrypted work, or otherwise to avoid, bypass, remove, deactivate, or impair a technological measure, without the authority of the copyright owner; and

    (B) a technological measure “effectively controls access to a work” if the measure, in the ordinary course of its operation, requires the application of information, or a process or a treatment, with the authority of the copyright owner, to gain access to the work.

    It is here that I must mention that I happen to have a reddit account, and I am somewhat familiar with that website. And I never, ever authorized any technological measure to limit access to my posts/comments. That doesn't mean reddit can't do it, but reddit never asked me and I never authorized it, so whatever is being circumvented does not, therefore (by DMCA's own words), "effectively control access to a work" because the technological measure was never authorized by the copyright owner. I suspect that no reddit users have authorized this, or at most, only reddit employees have been ordered by their bosses to authorize it.

    Furthermore, how do we know that the copyright owners don't authorize anyone to "avoid, bypass, remove, deactivate, or impair a technological measure" their copyrighted works? I authorize people to do that. (Indeed, my Slashdot sig below, is a reference to that.) I don't think I have ever said on reddit that I authorize it (the way i have done here on Slashdot) but if anyone (reddit?!?) ever bothers to ask me...

    There seems to be some popular misunderstanding of DMCA, that it prohibits cracking DRM. But that's only true if the copyright owner authorized the DRM in the first place and also if they don't authorizing cracking it. Neither of those two required conditions apply in this case.

    • If they have the technology to descramble the average reddit post, they're already sitting on trillion-dollar AGI tech.
  • If Reddit provides something on the internet, people can access it. Perplexity doesn't really train either, but processes search results to create an answer that is *not* in the model itself.
    Yeah, Reddits stupid "network security" tries to block VPN users, but if they are unable to block Perplexity, it's not Perplexities problem, is it? They can make Reddit login only, then someone has to accept ToS, but as long as it is freely available as long as your IP is not on a blacklist, it's just the open web.

Old programmers never die, they just become managers.

Working...