AI Training Data, Copyright, and Transparency: A Comparative Legal Analysis

The question is how copyright law treats copyrighted works used as AI training data; how licenses should allocate rights in training, fine-tuning, embeddings, retrieval, model weights, and outputs; who can own AI-generated or AI-assisted outputs; and how training-data transparency duties or litigation disclosure obligations can be reconciled with trade-secret protection.

The answer is jurisdiction-specific. In the United States, the central training-data issue is usually whether intermediate copying and model-development uses are infringing acts excused by fair use. In the EU, the analysis turns heavily on the text-and-data-mining exceptions in Directive 2019/790 and the separate public-summary/copyright-compliance obligations imposed on general-purpose AI model providers by the AI Act. In the U.K., the enacted text-and-data-mining exception remains limited to non-commercial research, while the U.K. also retains a statutory “computer-generated” works concept that differs materially from the U.S. and EU position.

Executive Summary

United States.

U.S. AI training commonly implicates the copyright owner’s exclusive rights because § 106 includes the rights to reproduce, prepare derivative works, distribute, perform, and display copyrighted works. The principal defense is fair use under § 107, which requires a four-factor analysis rather than a categorical rule that AI training is always lawful or always infringing.

The U.S. Copyright Office’s AI training report treats generative-AI training as fact-specific: uses for non-substitutive research or analysis are more likely to be fair, while copying pirate-source material to generate competing expressive content where licensing is available is high-risk. That report is official administrative material, not binding case law.

U.S. output ownership should be drafted as a contractual allocation of whatever rights may exist, not as an unconditional copyright warranty. Copyright protection requires original human authorship. Thaler v. Perlmutter is final D.C. Circuit authority after Supreme Court certiorari was denied on 2026-03-02, but it is not a Supreme Court merits holding.

California now imposes a public-facing generative-AI training-data transparency rule. A developer making a covered generative-AI system available to Californians must post documentation about training data, including a high-level dataset summary, sources or owners, whether datasets include copyright/trademark/patent-protected data or public-domain data, and whether datasets were purchased or licensed.

European Union.

In the EU, the DSM Directive creates TDM exceptions. Article 3 covers research organizations and cultural heritage institutions for scientific research. Article 4 covers reproductions and extractions of lawfully accessible works for TDM more broadly, but only where rightholders have not expressly reserved rights in an appropriate manner, including machine-readable means for online content.

The EU AI Act adds direct obligations for general-purpose AI model providers: technical documentation, downstream-provider information subject to IP and trade-secret protection, a copyright-compliance policy, compliance with DSM Article 4(3) rights reservations, and a public summary of training content. Those GPAI provisions apply from 2025-08-02.

EU training-data transparency is not full public corpus disclosure. AI Act recitals state that the summary should be generally comprehensive, not technically detailed, and should account for trade secrets and confidential business information; Article 78 separately imposes confidentiality duties on authorities handling protected information.

United Kingdom.

The U.K. position is more licensing-dependent for commercial AI training. CDPA s. 29A permits copying for computational analysis only for non-commercial research by a person with lawful access, and contract terms preventing such permitted copying are unenforceable only within that limited exception. The U.K. government’s 2026 report confirms that no broad commercial AI-training exception has been enacted.

The U.K. differs from the U.S. and EU on AI outputs because CDPA s. 9(3) treats the author of a computer-generated literary, dramatic, musical, or artistic work as the person by whom the arrangements necessary for creation are undertaken, and CDPA s. 178 defines “computer-generated” as generated by computer where there is no human author.

Trade-secret protection is a shield against unnecessary public disclosure, not a complete answer to transparency or discovery. In U.S. litigation, Rule 26 permits discovery of relevant, proportional nonprivileged matter but allows protective orders requiring trade secrets or confidential commercial information not to be revealed or to be revealed only in a specified way. In the EU, the AI Act and Trade Secrets Directive expressly preserve confidentiality mechanisms while allowing legally required disclosures. In the U.K., the Trade Secrets Regulations define trade secrets and preserve confidentiality in proceedings.

Jurisdiction Profile

United States. Binding authority consists of federal copyright statutes, Supreme Court decisions, and lower federal court decisions only within their jurisdictional limits. U.S. Copyright Office registration guidance and AI reports are official administrative materials and useful currency-checked agency views, but they are not statutes and do not bind courts on fair use. Title 17 as published by the Copyright Office is stated to include amendments through 2025-12-18; the Copyright Office’s Title 17 page is therefore used here as the current official federal copyright reference. California Civil Code §§ 3110–3111 are included as a U.S. state-level public-facing training-data documentation duty for covered generative-AI developers.

European Union. The principal primary sources are EU regulations and directives published in the Official Journal/EUR-Lex, and CJEU case law. The AI Act is directly applicable, with Article 53 and Article 78 applicable from 2025-08-02; the Regulation applies generally from 2026-08-02, but those GPAI and confidentiality provisions are already in force as of this memo date. Directives, including the DSM Directive and Trade Secrets Directive, bind Member States as to result and require national implementation.

United Kingdom. The primary copyright source is the Copyright, Designs and Patents Act 1988 as maintained on legislation.gov.uk. The CDPA page reports that it is up to date with all changes known to be in force on or before 2026-05-25. U.K. government consultations, progress statements, and reports are official administrative/policy materials, but they are not themselves binding legal authority unless they implement or explain enacted law. The U.K. government’s section 136 report was published on 2026-03-18; it confirms that the broad exception-with-opt-out proposal is no longer the preferred option and that no broad commercial AI-training exception has been enacted.

Issue 1 — Does AI training on copyrighted material require permission?

Conclusion

In the U.S., there is no general statutory AI-training exception. Training that makes copies can implicate § 106, and the outcome generally turns on fair use under § 107. In the EU, the DSM Directive provides structured TDM exceptions, but Article 4 is conditioned on lawful access and no effective rights reservation. In the U.K., the enacted TDM exception is limited to non-commercial research, so commercial training generally remains licensing-dependent.

United States

Rule

Section 106 gives the copyright owner exclusive rights, including reproduction, derivative-work preparation, distribution, public performance, public display, and digital-audio public performance for sound recordings. Section 107 then provides that fair use “is not an infringement” and lists the four factors: purpose and character, nature of the work, amount and substantiality, and market effect.

The Supreme Court’s fair-use cases frame the AI-training analysis but do not decide it. Google v. Oracle found fair use where Google copied portions of an API to allow programmers to use accumulated skills in a new transformative program, while Warhol held that where the challenged use shares the same or highly similar purpose as the original and is commercial, the first factor can weigh against fair use unless there is a distinct justification.

The Copyright Office’s Part 3 report treats training as a fair-use problem rather than a categorical rule. It identifies several copyright-relevant stages in generative-AI development, including copying for acquisition, curation, training, and output-related processes, and concludes that some training uses are likely fair while others are not. The Office’s high-risk example is copying from pirate sources to generate competing content where licensing is reasonably available; its lower-risk example is non-commercial research or analysis that does not meaningfully substitute for expression in the inputs.

Application

For a commercial model developer, the highest-risk U.S. training profile is: copyrighted expressive works; copied in full; obtained from unauthorized or pirate sources; used to create a model that outputs close substitutes, style-imitative works, or memorized passages; and deployed into the same licensing market as the input works.

The lower-risk profile is: licensed, public-domain, permissively licensed, or lawfully accessed data; use for non-substitutive research, classification, safety, or analytic functions; robust deduplication and memorization mitigation; no output of substantially similar protected expression; and documentation showing why the model use is functionally different from the original market.

A commercial developer’s highest-risk profile is also: full-work copying; expressive works; use of unauthorized or pirate sources; no provenance record; output products capable of substituting for the inputs; and no controls against memorization or near-verbatim output. The more defensible profile is: licensed or clearly lawful access; non-substitutive model function; documented necessity for copying; deduplication and regurgitation controls; and no material output substitution.

The 2025 Thomson Reuters v. ROSS decision is an important adverse data-training authority, but its scope should not be overstated. The court described ROSS as using Westlaw headnotes “as AI data” for a competing legal-research tool, stated that the AI at issue was not generative AI, and held that ROSS’s use was not transformative because it used the copied material to develop a competing legal-research product. It involved a non-generative legal-research tool and a competitor relationship; it is persuasive outside that court and should be drafted as a warning signal, not as a Supreme Court rule for all foundation-model training.

Limitations and counterarguments

Developers will rely on Google, intermediate-copying cases, non-expressive statistical learning, public benefit, transformative use, functional learning, lack of expressive storage, and lack of output substitution. Rightholders will rely on Warhol, market substitution, licensing-market harm, verbatim memorization, the availability of training-data licenses, same-purpose substitution, and bad-faith acquisition.

The strongest unresolved question is whether courts will treat model training as sufficiently transformative when the model is a general-purpose system rather than a product competing in the same market as the copied works.

European Union

Rule

The DSM Directive defines text and data mining as an automated analytical technique for analysing text and data in digital form to generate information, including patterns, trends, and correlations. Article 3 provides a TDM exception for research organizations and cultural heritage institutions for scientific research where they have lawful access. Article 4 permits reproductions and extractions of lawfully accessible works for TDM generally, but only where the rightholder has not expressly reserved use in an appropriate manner, including machine-readable means for online content.

AI Act Article 53 requires GPAI model providers to maintain technical documentation, provide downstream providers with needed information subject to protection of IP and trade secrets, put in place a policy to comply with Union copyright law, identify and comply with rights reservations under DSM Article 4(3), and make publicly available a sufficiently detailed summary of training content.

Application

A public-web dataset is not automatically safe merely because it is publicly reachable. The developer must classify whether the material was lawfully accessible and whether the rightholder made an effective rights reservation. For online works, the compliance file should preserve evidence of machine-readable rights-reservation checks, crawler rules, metadata, terms, exclusion lists, and subsequent withdrawal handling.

For research institutions, Article 3 may support TDM even where contract terms would otherwise obstruct the exception, because Article 7 makes contractual provisions contrary to Articles 3, 5, and 6 unenforceable; Article 4 is not listed in that contractual override.

For EU deployment, the operational question is not simply whether training data was “publicly available.” The relevant questions are whether the material was lawfully accessible, whether the rightholder made an effective rights reservation, whether the provider has a rights-reservation detection process, whether the Article 53 summary is sufficiently detailed under the AI Office template, and whether downstream information can be supplied without unnecessary exposure of trade secrets.

Limitations and counterarguments

The EU rule does not require work-by-work public disclosure of every training item. AI Act recitals say the summary should be generally comprehensive but not technically detailed, and should account for trade secrets and confidential business information. At the same time, a trade-secret objection cannot nullify Article 53: confidentiality affects the level of granularity and the disclosure channel, not the existence of the obligation.

United Kingdom

Rule

CDPA s. 16 gives copyright owners exclusive rights in the U.K. to copy, issue copies, rent or lend, perform/show/play, communicate to the public, and adapt the work; infringement occurs where a person does or authorizes a restricted act without the copyright owner’s licence.

CDPA s. 29A permits a person with lawful access to make a copy so that the person may carry out computational analysis of anything recorded in the work, but only for the sole purpose of non-commercial research and subject to acknowledgment requirements where practical. The section further states that a copy made under the exception becomes infringing if transferred or used for another purpose, and that contract terms preventing or restricting such copies are unenforceable.

Application

A U.K. university or research institution may have a stronger statutory position for non-commercial research TDM. A commercial foundation-model developer training on copyrighted works for product deployment generally should not rely on s. 29A. It should obtain licenses, use public-domain or permissively licensed data, or isolate U.K. training and deployment risks.

Commercial model training should not be drafted as if the U.K. has an EU-style broad commercial TDM exception. A U.K.-targeted training licence should expressly authorize copying, extraction, retention, indexing, training, fine-tuning, validation, and post-termination use or deletion of derived artifacts.

Limitations and counterarguments

The Data (Use and Access) Act 2025 required government reporting on copyright and AI, including the effect of copyright on access to and use of data by AI developers. The verified current position is that the U.K. government published its section 136 report on 2026-03-18. The report states that the previously preferred broad copyright exception with opt-out and transparency is no longer the preferred option, that there was no consensus, and that the government will not introduce reforms until it is confident they work in practice. It also records that the U.K. has no statutory licensing scheme for copyright works used to train AI models and no specific AI/copyright regulator.

Issue 2 — How should training-data licences be drafted?

Conclusion

A robust AI training-data licence should be technology-specific, corpus-specific, and jurisdiction-aware. It should not rely on generic language such as “AI rights included.” It must cover the actual technical acts and allocate responsibility for provenance, rights reservations, transparency, confidentiality, output similarity, and post-termination artifacts.

Rule

The licence must map to restricted acts. In the U.S., that means at least reproduction, derivative-work risk, distribution where copies are shared, display/performance where relevant, and ownership-transfer formalities where copyright ownership is transferred. Under § 201, copyright initially vests in the author, works made for hire have special authorship rules, ownership may be transferred, and ownership of a material object is distinct from copyright ownership; under § 204, a copyright transfer generally requires a signed writing.

In the EU, licence drafting must account for DSM Article 4 rights reservations and AI Act Article 53 obligations. Article 53 requires a GPAI provider to put in place a policy to comply with Union copyright law and to identify and comply with rights reservations expressed under DSM Article 4(3).

In the U.K., because commercial TDM is not covered by s. 29A, licence drafting should be explicit about commercial model development and deployment.

Limitations

Contract cannot make an unauthorized source lawful if the licensor lacks rights. Contract cannot defeat mandatory transparency duties. Contract cannot create copyright in AI-only output where the governing law denies copyright. Contract also cannot bind nonparties unless structured through chain-of-title, sublicensing, platform terms, or contributor agreements. In the U.S., license terms may reduce fair-use uncertainty but should not be drafted to admit that unlicensed use is necessarily infringing. In the EU, private confidentiality terms cannot override AI Act Article 53 public-summary duties.

Issue 3 — Who owns AI-generated and AI-assisted outputs?

Conclusion

U.S. law requires human authorship; EU law requires an author’s own intellectual creation; U.K. law uniquely recognizes statutory authorship for certain computer-generated works. Contracts should allocate use rights and economic control, but should not overstate copyright ownership in uncopyrightable output.

United States

Rule

Section 102 protects original works of authorship fixed in a tangible medium and excludes ideas, procedures, processes, systems, methods of operation, concepts, principles, and discoveries. Section 201 starts from authorship as the basis of initial ownership, subject to works-made-for-hire and transfer rules.

The Copyright Office’s AI copyrightability report states that existing law is adequate, material generated wholly by AI is not copyrightable, and works using AI as an assistive tool may be copyrightable where the resulting work contains sufficient human-authored expressive elements.

The Office also treats prompts cautiously. It recognizes that a prompt can itself be copyrightable if sufficiently expressive, but ordinary prompts are often closer to ideas, instructions, or methods than protectable authorship of the resulting AI output. Human selection, coordination, arrangement, and expressive modification can support protection, but protection does not extend to AI-generated elements standing alone.

In Thaler v. Perlmutter, the D.C. Circuit affirmed denial of registration where the work was claimed to be created by an AI system, and the opinion treated human authorship as required by the Copyright Act’s structure. The D.C. Circuit’s human-authorship decision is final after denial of certiorari, although not Supreme Court merits precedent.

Application

A company using generative AI to draft marketing copy, software, images, or reports should separate four layers: the user’s prompt; the raw output; human edits and arrangement; and the final integrated work. Copyright may attach to the human-authored prompt if expressive, to human edits if original, and to the final compilation if selection and arrangement are sufficiently creative. Copyright will not attach to uncopyrightable AI-only text or images merely because a contract says the user “owns” them.

Draft output terms in layers:

Inputs: Customer retains ownership of customer prompts, documents, code, images, and other inputs, subject to licences needed for processing.
Raw AI output: Provider assigns or disclaims any rights it may have, but the clause should say protection depends on applicable law.
Human edits: Human-authored revisions, selection, coordination, arrangement, and integration can support copyright where sufficiently original.
Final work product: Ownership should vest in the customer or employer through normal work-product, employment, assignment, or services-agreement mechanics.

For enterprise procurement, the correct ownership clause is not “customer owns all IP in outputs” without qualification. A better clause says the provider assigns or grants all rights it may have in outputs to the customer, disclaims ownership in customer inputs, and acknowledges that copyrightability is determined by applicable law. The customer should also receive contractual exclusivity or confidentiality only where technically and commercially feasible.

Counterarguments and limits

The Copyright Office’s reports are not binding judicial law. But they are controlling for registration practice and align with Thaler v. Perlmutter, where the D.C. Circuit rejected registration for a work attributed to an AI system rather than a human author.

European Union

Rule

CJEU originality doctrine requires a work to be the author’s own intellectual creation; Painer states that an intellectual creation is the author’s own if it reflects the author’s personality, and CJEU cases such as Cofemel treat copyright protection as limited to subject matter that is original in that sense.

Application

Raw AI-only output is unlikely to be protected as a copyright work if no human made free and creative choices in the expression. EU-facing output clauses should therefore avoid asserting that all outputs are protected works. They should allocate contractual use, exclusivity if commercially offered, confidentiality, indemnity, and non-assertion covenants, while preserving the possibility that human-authored selection or editing may be protected.

For cross-border licensing, an EU-facing contract should not assume that the same output ownership rule applies across all Member States or that AI-only outputs have copyright status.

United Kingdom

Rule

CDPA s. 9 provides that for computer-generated literary, dramatic, musical, or artistic works, the author is taken to be the person by whom the arrangements necessary for creation of the work are undertaken. CDPA s. 178 defines “computer-generated” as generated by computer in circumstances such that there is no human author. CDPA s. 12 provides a 50-year term for computer-generated works from the end of the calendar year in which the work was made.

Application

In a simple prompt-to-output workflow, the U.K. result may favor the user more than U.S. law does, because the user may be characterized as making the arrangements necessary for creation. But where a provider pre-configures system prompts, controls model parameters, supplies retrieval context, or automates generation, authorship may be contested between user, provider, employer, or no qualifying owner depending on facts.

A U.K. prompt-to-output workflow may support a different ownership analysis from the U.S. A user, employer, provider, or system operator may argue that it undertook the arrangements necessary for creation. The correct drafting response is to allocate ownership contractually among user, provider, and enterprise customer, while recognizing that statutory authorship may be fact-sensitive where the provider controls model configuration, system prompts, retrieval context, safety filters, and automation.

Issue 4 — What training-data transparency duties apply, and how do they interact with trade secrets?

Conclusion

Transparency duties are increasing, but they do not generally require full public disclosure of exact corpus contents. The strongest current public disclosure obligations are California Civil Code § 3111 for covered public-facing generative-AI systems and EU AI Act Article 53 for GPAI providers. Trade secrets support confidential treatment and staged disclosure, but they do not nullify statutory transparency or litigation discovery.

United States federal law

Rule

Federal copyright law, as reviewed here, does not impose a general public training-corpus disclosure duty. But litigation discovery is different. Rule 26(b)(1) allows discovery of nonprivileged matter relevant to a claim or defense and proportional to the needs of the case. Rule 26(c) allows protective orders for good cause, including orders specifying disclosure terms and requiring that trade secrets or confidential commercial information not be revealed or be revealed only in a specified way.

The DTSA defines a trade secret as information for which the owner has taken reasonable measures to keep it secret and that derives independent economic value from not being generally known or readily ascertainable. FOIA Exemption 4 protects trade secrets and confidential commercial or financial information obtained from a person.

Application

In copyright litigation, dataset identity, source legality, licence status, opt-out compliance, and output memorization may be central to liability, fair use, market harm, or damages. A developer should not rely on a blanket “trade secret” refusal. The better litigation position is a staged protective order: public pleadings at high level, confidential summaries, outside-counsel-only access, expert-only inspection, sampling, hashed manifests, sealed exhibits, and escrowed corpus review.

A model developer should treat the full corpus, source weights, deduplication rules, quality filters, data recipes, evaluation sets, red-team results, and memorization mitigations as separate confidentiality classes. The developer should not simply refuse disclosure by saying “trade secret.” The better position is staged disclosure: public pleadings at high level; confidential summaries; outside-counsel-only review; expert-only inspection; sampling; hashes or manifests; escrowed corpus review; and sealed exhibits where necessary.

Counterarguments and limits

Plaintiffs will argue that dataset identity, source legality, opt-out compliance, memorization, and licensing-market harm are central to copyright claims. Courts may compel enough information to test those claims. Trade-secret protection affects how disclosure occurs; it rarely eliminates discovery entirely where the information is central and no less intrusive substitute exists.

California

Rule

California Civil Code § 3111 requires covered developers to post training-data documentation on the developer’s website on or before 2026-01-01 and before later public release of a generative-AI system or substantial modification made available to Californians.

Required content includes a high-level dataset summary; sources or owners; intended purpose; number and types of data points; whether datasets include copyright, trademark, patent, or public-domain data; whether datasets were purchased or licensed; personal-information and aggregate-consumer-information indicators; cleaning/processing/modification; collection period; first-use dates; and synthetic-data use. The listed exceptions cover security-and-integrity-only systems, aircraft operation, and national-security/military/defense systems made available only to a federal entity.

Application

California Civil Code § 3111 is a significant U.S. transparency rule. It requires covered developers to post specified training-data documentation, including whether datasets include copyrighted, trademarked, patented, or public-domain data and whether datasets were purchased or licensed. It is a documentation duty, not a copyright licence. It does not itself make training lawful. It also does not require a work-by-work public inventory, but it does require enough information to answer copyright/licensing/public-domain status at dataset level.

European Union

Rule

AI Act Article 53 requires GPAI providers to maintain technical documentation, provide downstream-provider information subject to protection of IP and trade secrets, implement a copyright-compliance policy, identify and comply with DSM Article 4(3) rights reservations, and make publicly available a sufficiently detailed summary of training content using the AI Office template.

Annex XI separately requires technical documentation to include information about training, testing, and validation data, including type, provenance, curation, number of data points, scope, and main characteristics.

AI Act Article 78 requires covered authorities and bodies to respect confidentiality of information and data obtained in carrying out AI Act tasks, including IP rights, confidential business information, trade secrets, and source code. It also requires data requests to be limited to what is strictly necessary to assess compliance.

The AI Act recitals clarify the intended granularity: the public summary should be generally comprehensive in scope, not technically detailed, and should account for trade secrets and confidential business information. It should list main datasets or data collections used for training and provide a narrative explanation about other data sources.

The Trade Secrets Directive defines a trade secret as information that is secret, has commercial value because it is secret, and has been subject to reasonable steps to keep it secret. It also requires Member States to preserve confidentiality of trade secrets in legal proceedings, including by restricting access to documents, hearings, and non-confidential versions of decisions where appropriate.

Application

The correct EU documentation architecture is two-tiered. The public Article 53 summary should identify main data collections, data categories, modalities, collection periods, and corpus-level provenance at the template-required level. A confidential supervisory package should preserve exact source lists, dataset weights, filtering recipes, model-card details, opt-out logs, rights-reservation evidence, evaluation sets, and safety artifacts under Article 78 confidentiality.

An EU GPAI provider should prepare two versions of its training-data documentation. The public Article 53 summary should identify main datasets, data collections, data categories, time periods, modalities, and curation principles at the template-required level. A confidential regulator-facing package should preserve exact source lists, weights, filtering logic, and sensitive recipes for Article 78 treatment. The provider should maintain an internal legal basis map: licensed data, public-domain data, Article 4 TDM data with no effective reservation, opt-out excluded data, and data retained under research or safety exceptions.

The Commission published the general-purpose AI Code of Practice on 2025-07-10, the Commission Guidelines on GPAI model obligations on 2025-07-18, and the explanatory notice/template for public summaries of GPAI training content on 2025-07-24. The Commission describes the Code as a voluntary tool for compliance with Article 53 transparency and copyright obligations, while the guidelines are non-binding Commission interpretation.

Counterarguments and limits

A provider cannot use “trade secret” to avoid the public summary altogether. Conversely, rightholders cannot assume Article 53 gives them full work-by-work discovery of a corpus. The AI Act compromises through summarized public transparency, confidential supervisory access, and copyright-compliance documentation.

United Kingdom

Rule

The Data (Use and Access) Act 2025 requires a report on the effect of copyright law on access to and use of data by AI developers, including text and data mining, and requires a progress statement before the end of six months.

The U.K. Trade Secrets Regulations define a trade secret as information that is secret, has commercial value because it is secret, and has been subject to reasonable steps to keep it secret. Regulation 10 requires participants with access to trade secrets or alleged trade secrets in proceedings not to use or disclose them, subject to the regulation’s conditions.

Application

The 2026 government report confirms transparency remains under active policy consideration, including disclosure by AI developers, technical measures, licensing, and enforcement. But the report does not itself enact a developer-facing public training-summary duty.

A U.K. developer should still prepare an audit-ready provenance file because the statutory report process expressly covers disclosure, licensing, web crawlers, and AI systems developed outside the U.K.

A U.K.-only model developer should not assume it must publish an EU-style training summary unless it is also an EU AI Act GPAI provider or otherwise subject to non-U.K. obligations. It should, however, prepare audit-ready internal records because legislative reform, litigation, procurement audits, and investor diligence may require a defensible data provenance file.

Getty Images v. Stability AI is also relevant to confidentiality handling: the High Court allowed only limited redactions for technical trade-secret information after confidentiality review, reflecting the tension between open justice and protected technical information. It should not be cited as a merits ruling that training was lawful or unlawful because the training/development copyright claim and output copyright claim were abandoned.

Conclusions

United States — training data. There is no verified federal statutory rule that generative-AI training is categorically lawful or categorically infringing. The controlling statute remains 17 U.S.C. §§ 106 and 107: training that creates copies can implicate exclusive rights, and the principal defense is fair use. The Copyright Office’s Part 3 report remains official administrative material and frames generative-AI training as fact-specific, not categorical. The Supreme Court’s Google v. Oracle and Warhol decisions supply controlling fair-use principles, but neither decides foundation-model training.

United States — output ownership. U.S. copyright protection still requires human authorship. AI-generated material without sufficient human authorship is not registrable as a copyrighted work under the Copyright Office’s position and the final D.C. Circuit judgment in Thaler. Contracts may assign or disclaim whatever rights a provider may have, but they cannot create copyright in uncopyrightable AI-only material.

United States — litigation transparency and trade secrets. No general federal copyright-law public training-corpus disclosure duty was verified. In litigation, however, training data, source records, licensing records, and model-development records may be discoverable if relevant and proportional. Rule 26(c) authorizes protective orders for trade secrets and confidential commercial information, so the correct litigation posture is staged protected disclosure, not categorical refusal.

California — public training-data documentation. For covered generative-AI systems made available to Californians, California Civil Code §§ 3110–3111 impose a public-facing training-data documentation obligation. This is a transparency duty, not a copyright license or infringement safe harbor.

European Union — training data. The DSM Directive provides the central TDM structure. Article 3 covers scientific-research TDM by research organizations and cultural heritage institutions with lawful access. Article 4 provides a broader TDM exception for lawfully accessible works, but it is unavailable where rightholders have expressly reserved rights in an appropriate manner, including machine-readable means for online content.

European Union — GPAI transparency and copyright compliance. AI Act Article 53 requires GPAI model providers to maintain documentation, provide downstream information subject to IP/trade-secret protection, implement a Union copyright-law compliance policy, identify and comply with DSM Article 4(3) rights reservations, and publish a sufficiently detailed summary of training content. Article 78 separately protects confidential information, including intellectual property, confidential business information, trade secrets, and source code.

United Kingdom — training data. CDPA s. 29A remains narrow: it permits copying for computational analysis only for non-commercial research by a lawful-access user. The U.K. has not enacted a broad commercial AI-training exception. The 2026 government report confirms that the broad exception-with-opt-out proposal is no longer preferred and that further evidence-gathering is planned.

United Kingdom — output ownership. The U.K. remains distinct because CDPA s. 9(3) treats the author of a computer-generated literary, dramatic, musical, or artistic work as the person by whom the arrangements necessary for creation are undertaken, and CDPA s. 178 defines computer-generated works as generated by computer where there is no human author. This differs materially from the U.S. human-authorship rule.

AI Training Data, Copyright, and Transparency: A Comparative Legal Analysis

Executive Summary

United States.

European Union.

United Kingdom.

Jurisdiction Profile

Issue 1 — Does AI training on copyrighted material require permission?

Conclusion

United States

European Union

United Kingdom

Issue 2 — How should training-data licences be drafted?

Conclusion

Rule

Limitations

Issue 3 — Who owns AI-generated and AI-assisted outputs?

Conclusion

United States

European Union

United Kingdom

Issue 4 — What training-data transparency duties apply, and how do they interact with trade secrets?

Conclusion

United States federal law

California

European Union

United Kingdom

Conclusions

More from the journal

Ireland Publishes Regulation of Artificial Intelligence Bill 2026 to Implement EU AI Act

EU Code of Practice on AI-Generated Content Transparency Published, Effective 2 August 2026

EU Commission Publishes AI Content Labelling Code of Practice, June 2026

Ready to launch without the regulatory guesswork?

Try Licentium AI

Browse the Fintech Licensing Hub

Talk to us