Copyright Law | Data Ownership | Technology
Introduction: A Fight for the Future of AI Data
In a landmark case that could redefine the boundaries of data ownership in the age of artificial intelligence, Reddit, Inc. has filed suit against Perplexity AI, Inc., accusing the startup of “industrial-scale scraping” of Reddit’s user-generated content to train and operate its AI-powered “answer engine.”
Filed in the U.S. District Court for the Southern District of New York on October 22, 2025, the complaint alleges that Perplexity and several data-scraping intermediaries—including Oxylabs UAB, AWMProxy, and SerpApi, Inc.—violated copyright law, anti-hacking statutes, and unfair-competition principles by illicitly harvesting Reddit content.
Reddit’s complaint characterizes the conduct as “systematic theft” of data that Perplexity “desperately needs” to compete in the AI marketplace.
The Allegations: Scraping, Circumvention, and Commercial Use
According to court filings and statements reported by Reuters and The Verge, Reddit alleges that Perplexity:
- Bypassed technical barriers (including rate limits, CAPTCHAs, and API restrictions) to harvest massive volumes of Reddit posts and comments;
- Used third-party proxy and data-scraping firms to conceal its identity and avoid detection;
- Integrated the scraped material into its AI “answer engine,” which provides detailed, human-like responses to user queries; and
- Profited commercially from that data while refusing to enter a licensing agreement with Reddit.
The complaint asserts that Reddit sent Perplexity a cease-and-desist letter in May 2024, yet instances of Perplexity citing Reddit content allegedly increased forty-fold afterward. Reddit seeks monetary damages, disgorgement of profits, and injunctive relief to bar further unauthorized access and use of its data.
Perplexity, for its part, has denied wrongdoing, claiming it “will always fight vigorously for users’ rights to freely and fairly access public knowledge.” The company maintains that it does not train its foundational models on Reddit data.
Legal Claims and Theories
1. Copyright Infringement and DMCA Violations
Reddit asserts that its database of user content constitutes a protected compilation under the U.S. Copyright Act. The defendants’ alleged scraping and reproduction of that content for commercial purposes, Reddit says, amounts to direct and contributory infringement.
Additionally, the complaint invokes the anti-circumvention provisions of the Digital Millennium Copyright Act (17 U.S.C. § 1201), arguing that the defendants bypassed technological measures designed to control access to Reddit’s copyrighted material.
2. Breach of Contract and Computer Fraud
Reddit claims that by using automated tools in violation of its Terms of Service, the defendants engaged in breach of contract and possibly unauthorized access under the Computer Fraud and Abuse Act (CFAA). Courts have historically been divided on whether data scraping constitutes a CFAA violation—most notably in hiQ Labs, Inc. v. LinkedIn Corp. (9th Cir. 2022)—but Reddit’s complaint aims to push the boundary toward stricter enforcement.
3. Unfair Competition and Unjust Enrichment
Reddit alleges that Perplexity gained an unfair commercial advantage by exploiting Reddit’s high-quality, human-generated discussions to improve its AI product—without bearing the cost of acquiring or licensing that data. The complaint describes this as “unjust enrichment through misappropriation of proprietary data assets.”
Legal and Policy Context
The lawsuit comes just months after Reddit announced multi-million-dollar data-licensing deals with OpenAI and Google, cementing its content as one of the most valuable sources for natural-language AI training.
By contrast, Perplexity has positioned itself as a “search alternative,” relying on data aggregation to fuel its conversational AI. The company reportedly raised over $100 million in funding in 2025 and has aggressively marketed itself as a privacy-respecting, ad-free search experience.
This case therefore touches on several critical legal and policy questions:
- Who owns publicly visible data—the platform, the user, or the public?
- Can scraping of public websites be legally restricted, even when no login or paywall is involved?
- Does fair use protect the ingestion of online text by AI models, or does it constitute commercial exploitation requiring a license?
Potential Defenses
Perplexity and its co-defendants are expected to raise several key defenses:
- Public Availability: The data at issue was publicly accessible without authentication, making it fair game for automated collection.
- Fair Use: Even if copyright applies, the use was transformative—repurposed to generate insights rather than substitute for Reddit’s content.
- No Proven Damages: Reddit must show actual harm, lost licensing revenue, or competitive injury.
- Lack of Standing: Because Reddit’s users—not Reddit itself—authored the posts, the company’s standing to sue for copyright infringement may be limited to its broad license under the user agreement.
The Broader Stakes: Data Governance in the AI Economy
Legal observers note that Reddit v. Perplexity could become a bellwether case for the AI industry.
If Reddit prevails, platforms may gain strong legal backing to monetize their data through exclusive licensing deals and to block scraping via litigation. AI startups could face mounting costs or reduced access to training material, consolidating power among firms that can afford paid data.
If Perplexity wins, the decision could reinforce the principle that publicly accessible data remains part of the open internet, potentially insulating AI firms from a wave of content-owner lawsuits.
“This case sits at the heart of the tension between open access and proprietary control,” says IP attorney Dr. Marcus Feldman of Columbia Law School. “The court’s treatment of fair use, circumvention, and contract enforcement could set the tone for the next decade of AI litigation.”
Conclusion: A Defining Case for the AI Era
At its core, Reddit v. Perplexity is about more than web scraping—it is about who controls the raw material of intelligence itself. Reddit argues that without legal safeguards, companies like Perplexity will profit from unpaid labor and erode incentives for content creation. Perplexity counters that restricting access to public information undermines the very openness that built the internet.
Whatever the outcome, this case will likely shape the emerging body of AI data jurisprudence, determining whether data is treated as a commons to be shared—or a commodity to be licensed and defended.
The first pretrial conference is expected later this year, and the tech world will be watching closely. As generative AI expands, the law is being forced to answer an unprecedented question: Where does the public internet end—and private property begin?