12/23/2025 7:24:12 PM | 4 minute read

Is Your Site’s Robots.txt Giving Content to AI Models for Free?

Get in touch

Peter Devlin

Associate

Get in touch

Peter Devlin

Associate

The Robots Exclusion Protocol file, colloquially called robots.txt, does just what it says: instructs automated bots whether and how they may access a website. As several recent lawsuits show, this unassuming text file with origins in the early internet is proving to be a central strategy for protecting valuable online content and negotiating AI licensing deals.

Historically, robots.txt’s purpose was to help search engines index sites appropriately so they would appear in relevant search results. Now, it’s on the front lines of the AI training wars. Businesses should audit their robots.txt and terms of service to make sure this text file supports efforts to protect content—not offer it up on a silver platter.

What Is the Robots Exclusion Protocol?

Robots.txt is a simple text file at the root of your domain (example.com/robots.txt) that communicates with the legions of automated bots crawling the internet. It can:

Tell bots whether they can access the site at all.
Limit which parts of the site they may access.
Point them to a sitemap to facilitate indexing.
Address specific bots by name (e.g., a particular search engine’s or AI’s crawler) with tailored instructions.

For example, robots.txt could exclude a specific bot like so:

User-agent: CCBot

Disallow: /

As bots that hoover up internet data have proliferated, robots.txt has become a way to “blacklist” specific bots or categories of bots—functioning as a “no trespassing” sign for automated agents. Coordinated with a website’s terms of service, it can be a defense against unauthorized use of protected content—or, if silent or overly permissive, an unintended welcome mat.

From Implied License to Offensive Use

Until recently, case law involving robots.txt mainly addressed implied license questions: when a site’s robots.txt did not block a search engine or other bot, did that silence imply permission to access content? Examples like Field treated the absence of disallow rules as evidence that the website owner had implicitly consented to search indexing and caching. 412 F. Supp. 2d 1106 (D. Nev. 2006). Robots.txt’s primary legal role was defensive.

The new generation of AI‑scraping cases has turned that dynamic around. Online companies are now using robots.txt affirmatively, alongside terms of service and other technical measures, to argue that AI companies lacked authorization to scrape, knew they lacked authorization, and yet chose to proceed anyway. Two main theories are emerging: (1) circumvention in violation of the Digital Millennium Copyright Act (“DMCA”) and (2) breach of contract.

Theory No. 1: DMCA Anti-Circumvention

Early signals from courts show vulnerabilities in the DMCA theory, but there is a potential workaround. In Ziff Davis v. OpenAI, No. 1:25-cv-04315 (S.D.N.Y. 2025), Ziff Davis alleges that OpenAI’s bot ignored Ziff Davis’s robots.txt and that doing so violated the DMCA’s anti‑circumvention rules (17 U.S.C. § 1201).

The court disagreed. On a motion to dismiss, the court ruled that robots.txt is not a “technological measure that effectively controls access” to copyrighted works, since it is more akin to a sign than a barrier. The robots.txt instruction is effectively voluntary—the bot must be configured to actually look for and follow the robots.txt instructions. This ruling will not be the last word however.

Reddit’s lawsuit against SerpApi and others augments robots.txt with other measures. Reddit v. SerpApi (S.D.N.Y. 2025). Reddit’s theory potentially cures the Ziff Davis problem by pointing to a third-party technological measure rather than relying on robots.txt alone. Reddit alleges that the defendants bypassed not just Reddit’s anti-scraping measures (including robots.txt) but also search engine’s technological protections of snippets of Reddit’s content in search results. The search engine’s access controls are the bypassed “technological measure,” and robots.txt mainly serves as evidence that the scrapers knew that Reddit did not authorize access to its content.

The lesson: robots.txt alone probably won’t support a DMCA claim under current case law, but it remains evidence of lack of authorization when combined with other technical measures.

Theory No. 2: Breach of Contract

An alternative path for robots.txt lies in contract law. In another Reddit lawsuit, this one against Anthropic, Reddit’s lead claim is for breach of its terms of service. Reddit v. Anthropic, No. 3:25-cv-05643 (N.D. Cal. 2025). It is not directly a copyright claim. Reddit alleges that its terms of service prohibit scraping and certain automated uses that conflict with its robots.txt and its approved APIs. Reddit further alleges that bots accessing Reddit are subject to those terms, and Anthropic knowingly breached those terms.

A central issue here will be whether Anthropic assented to Reddit’s terms of service. Courts have been highly skeptical of browsewrap agreements like the one Reddit alleges, especially in the consumer protection context. But Reddit’s theory—that a sophisticated commercial AI company with actual notice cannot claim ignorance—may fare better.

Reddit’s theory gains force from allegations that it offers an authorized alternative to scraping: its API. When a company provides a legitimate access channel and a scraper bypasses it, courts may view that as evidence of a knowing violation rather than innocent access of public content.

A Checklist for Auditing Your Robots.txt File

For any business with valuable web content or AI licensing ambitions, robots.txt is no longer purely operational—it’s part of your legal posture. Consider:

Auditing robots.txt.
- Does it block or restrict scraping and AI crawlers by name?
- Or is it effectively silent, potentially signaling openness to all bots?
Align it with your terms of service.
- Expressly prohibit unlicensed scraping and AI training uses.
- Cross reference robots.txt and API terms (e.g., access inconsistent with robots.txt or outside approved APIs is prohibited).
Support with logging and licensing structure.
- Keep records that could show a particular bot ignored robots.txt.
- If you have, or want, AI licensing deals, make sure you can tell a coherent story about authorized channels (APIs, feeds, search‑snippet licenses) versus unauthorized scraping.

Robots.txt will not, by itself, turn scrapers into persistent violators of the DMCA or terms of service. But it can be powerful evidence that no implied license to crawl or train exists—and that scrapers crossing the line do so at their own peril.