12/8/2025 6:08:05 PM | 2 minute read

Countdown to 2026: Preparing for California’s New AI Training Data Transparency Obligations

Get in touch

Andrew Folks

Attorney

Get in touch

Andrew Folks

Attorney

As the pace of state AI governance legislation stabilizes, training data transparency has emerged as a regulatory focus. California’s AB 2013 provides an early framework, requiring businesses that develop, substantially modify, or, in some cases, deploy public-facing generative AI systems to satisfy significant disclosure and documentation obligations beginning January 1, 2026. Given potential for private litigation, businesses hosting chatbots and other consumer-facing AI services should ensure required notices are compliant in advance of the new year.

AB 2013 requires developers of generative AI systems or services to post detailed documentation on their websites about the data used to train these systems. “Developer” includes any entity that designs, codes, produces, or substantially modifies an AI system or service for use by members of the public. A “substantial modification” includes any new version, release, or other update that materially changes functionality or performance, including changes resulting from retraining or fine tuning. This broad framework extends the law beyond traditional model developers: any entity that develops a fine-tuned or derivative model—and, in some cases, even a wrapper—on top of a licensed frontier model may fall within scope.

Businesses subject to the law are required to publish a high-level summary on their website, in no specific location, covering the following:

The sources or owners of the datasets.
A description of how the datasets further the intended purpose of the AI system or service.
The number of data points in the datasets, which may be expressed as general ranges, with estimated figures for dynamic datasets.
A description of the types of data points within the datasets. For labeled datasets, this means the types of labels used; for unlabeled datasets, it means the general characteristics.
Whether the datasets include any data protected by copyright, trademark, or patent, or whether the datasets are entirely in the public domain.
Whether the datasets were purchased or licensed by the developer.
Whether the datasets include personal information, as defined in CCPA § 1798.140(v).
Whether the datasets include aggregate consumer information, as defined in CCPA § 1798.140(b).
Whether the developer cleaned, processed, or otherwise modified the datasets, including the intended purpose of those efforts in relation to the AI system or service.
The time period during which the data in the datasets were collected, including a notice if the data collection is ongoing.
The dates the datasets were first used during the development of the AI system or service.
Whether the system or service used or continuously uses synthetic data generation in its development, with an optional description of the functional purpose of the synthetic data.

These requirements are likely to apply retrospectively, meaning developers should ensure they can document and disclose historical training data practices.

Legislative analysis indicates that AB 2013 is enforceable through California’s Unfair Competition Law, enabling both public and private enforcement. If a consumer or business can satisfy harm standards in the UCL, AB 2013 may lead to an increase in AI-related private litigation.