From the journal

Privacy, GDPR, and Personal Data in AI Models

When does the GDPR treat data used in, generated by, or retained in AI models as “personal data,” what legal bases and obligations apply to model training and deployment, and what are the consequences if an AI model was developed using personal data unlawfully?

15 min read

When does the GDPR treat data used in, generated by, or retained in AI models as “personal data,” what legal bases and obligations apply to model training and deployment, and what are the consequences if an AI model was developed using personal data unlawfully?

Executive summary

  • AI training and deployment are GDPR-regulated whenever personal data is processed. GDPR Article 4(1) uses a broad concept of personal data, and the EDPB applies that concept to AI-model development and deployment. EDPB Opinion 28/2024 PDF.
  • A model trained on personal data is not automatically anonymous. The EDPB states that AI models trained on personal data “cannot, in all cases, be considered anonymous”; anonymity requires a case-by-case assessment of whether personal data can be extracted or obtained through reasonably likely means.
  • Model parameters can be relevant to GDPR analysis. Even where the model is not designed to output training data, personal data may remain “absorbed” in parameters and may be extractable or otherwise obtainable from the model. EDPB Opinion 28/2024 PDF.
  • Each processing stage needs its own lawful-basis analysis. EDPB separates collection, preprocessing/filtering, training, prompts/output, and training with prompts, and states that each processing of personal data must satisfy Article 6 and, where applicable, Article 9. EDPB ChatGPT Taskforce report PDF.
    5. Legitimate interests may be available, but only after a documented three-step test. The assessment requires a legitimate interest, necessity, and balancing of rights and interests; less intrusive alternatives and data minimisation are central.
  • Publicly accessible data is not free-use data. For special-category data, public accessibility alone does not mean the data subject “manifestly made” the data public; the EDPB requires an intentional, clear affirmative act. EDPB ChatGPT Taskforce report PDF.
  • Unlawful training can affect the model itself. The EDPB states that supervisory authorities may order fines, temporary processing limits, erasure of unlawfully processed datasets, erasure of the model itself, or retraining, depending on proportionality and facts.
  • The AI Act is additive, not substitutive. AI Act obligations may impose additional governance, transparency, and risk-management duties, but they do not supply a GDPR lawful basis or displace GDPR rights and principles.

When is data in or around an AI model “personal data”?

Conclusion. Training data, prompts, outputs, embeddings, retrieval stores, logs, labels, fine-tuning examples, evaluations, and generated responses are personal data where they relate to an identified or identifiable natural person. Model parameters or weights are not categorically personal data, but they may be personal data where information relating to identifiable persons can be extracted, inferred, or obtained from them by reasonably likely means.

Rule. GDPR Article 4(1) defines personal data as information relating to an identified or identifiable natural person. The EDPB, applying GDPR Article 4(1) and Recital 26, states that anonymous information is outside GDPR only where the data does not relate to an identified or identifiable person or has been rendered anonymous so that the data subject is not or no longer identifiable, taking account of “all the means reasonably likely to be used.” EDPB Opinion 28/2024 PDF.

The CJEU’s Breyer line of authority supports a relative identifiability analysis: data can be personal data where identification is possible using additional information held by another party, unless identification is prohibited by law or practically impossible due to disproportionate time, cost, and manpower. Breyer, C-582/14, EUR-Lex.

Application. In AI development, raw source data is often the easiest case: web pages, forum posts, customer records, HR data, chat transcripts, images, voice recordings, code comments containing names, and support tickets may all contain personal data. Preprocessing does not necessarily remove GDPR status if identifiers, rare facts, opinions, location traces, or linkable pseudonyms remain.

Embeddings and feature vectors require the same functional analysis. Even if not human-readable, they may relate to a person if they encode text, behaviour, preferences, biometric characteristics, or other linkable attributes. Retrieval-augmented generation systems add another layer: the retrieval index, vector database, prompt context, and output can all process personal data.

For model weights, the correct question is not whether a human can inspect the weights and read a name. The question is whether personal data remains represented in the model such that it can be extracted, regurgitated, inferred, or obtained through prompt interaction, adversarial probing, membership inference, model inversion, or other reasonably likely techniques. The EDPB specifically notes that information from the training dataset, including personal data, may remain represented in model parameters and may be obtained directly or indirectly from the model. EDPB Opinion 28/2024 PDF.

Limitations and counterarguments. A controller can argue that the model is anonymous, but the EDPB requires evidence. Relevant evidence includes source selection, deduplication, filtering, training controls, privacy-preserving methods, red-team extraction testing, output filtering, access restrictions, monitoring, and documentation. The absence of a literal training record in the model is not enough.

When can an AI model be treated as anonymous?

Conclusion. An AI model trained on personal data may be treated as anonymous only after a documented, case-specific assessment showing that the risk of extracting or otherwise obtaining personal data is insignificant in light of the state of the art, access conditions, model behaviour, and reasonably likely attack methods.

Rule. EDPB Opinion 28/2024 states that AI models trained on personal data “cannot, in all cases, be considered anonymous.” It requires a case-by-case assessment based on specific criteria.

The EDPB also distinguishes models designed to provide personal data from those not designed for that purpose. Models fine-tuned to mimic a specific person’s voice or designed to answer with personal data about specific people will generally involve personal-data processing and cannot be treated as anonymous.

Application. The anonymity analysis should be performed at least at the following points: after pretraining, after fine-tuning, after alignment/safety training, before deployment, after deployment architecture is fixed, and after meaningful changes to access conditions. A public chatbot, an internal API, an open-weight model, and an on-device model present materially different extraction and identifiability risks.

The controller’s anonymity file should include model cards or technical documentation, training-data provenance, data-minimisation measures, memorisation testing, extraction testing, known attack-surface analysis, prompt-filtering controls, output logging policies, API rate limits, and incident-response triggers. Where the model can be queried at scale, the risk assessment should address adversarial users, not only ordinary users.

Limitations and counterarguments. Anonymity is not a one-time label. A model that is anonymous under one deployment configuration may cease to be anonymous if released as open weights, connected to a personal-data retrieval store, fine-tuned on identifiable data, or exposed through weak prompt/output controls.

What lawful basis can support AI development and deployment?

Conclusion. Each distinct processing operation needs an Article 6 lawful basis. Legitimate interests under Article 6(1)(f) may support some AI development or deployment, but it is not a safe harbor. The controller must document the interest, necessity, and balancing test.

Rule. The EDPB ChatGPT Taskforce states that each processing of personal data must meet at least one Article 6(1) condition and, where applicable, Article 9(2). It separates AI processing into collection of training data, preprocessing/filtering, training, prompts/output, and training with prompts. EDPB ChatGPT Taskforce report PDF.

For Article 6(1)(f), the EDPB identifies the usual three criteria: existence of a legitimate interest, necessity of processing, and balancing of interests, including data subjects’ reasonable expectations. EDPB ChatGPT Taskforce report PDF.

EDPB Opinion 28/2024 further states that necessity requires asking whether the activity will allow pursuit of the purpose and whether there is no less intrusive way to pursue it. The Opinion also emphasizes the volume of personal data and proportionality to the legitimate interest.

Application. A legally defensible AI data map should break processing into separate operations: source selection, scraping or acquisition, storage, cleaning, deduplication, filtering, annotation, training, fine-tuning, safety evaluation, red-teaming, deployment inference, user-prompt logging, output generation, abuse monitoring, and post-deployment improvement.

The controller should avoid defining the purpose as “AI development” in the abstract. Stronger purpose statements identify a concrete function, such as fraud detection, cybersecurity detection, automated customer-support assistance, quality assurance, accessibility tooling, or language translation. The broader and more speculative the purpose, the harder it is to satisfy necessity and balancing.

Consent may be available for first-party fine-tuning or user-uploaded training contexts, but it must satisfy GDPR consent requirements and withdrawal consequences. Contract necessity is narrow; processing that merely improves a provider’s model or business operations is unlikely to be objectively necessary for a user contract unless the AI processing is core to the contracted service. Legitimate interests may be most frequently asserted for training, but it is vulnerable where scraping is unexpected, large-scale, opaque, includes sensitive data, includes children, or creates downstream risks of exposure, discrimination, or manipulation.

Limitations and counterarguments. Safeguards can help the balancing test but cannot replace Article 6 itself. EDPB-EDPS Joint Opinion 2/2026 confirms that legitimate interests may apply in some AI contexts under current GDPR, but it also states that a case-by-case Article 6(1)(f) test remains necessary. EDPB-EDPS Joint Opinion 2/2026 PDF.

Confidence: High for the three-step framework; Medium for specific training uses because outcomes depend heavily on dataset, purpose, safeguards, and expectations.

How does Article 9 apply to scraped or inferred special-category data?

Conclusion. If the AI lifecycle processes special-category data, the controller needs both an Article 6 lawful basis and an Article 9 exception. Public availability alone does not satisfy Article 9(2)(e).

Rule. The EDPB ChatGPT Taskforce states that scraped data can include special categories of personal data under Article 9(1). It further states that, for Article 9(2)(e), “the mere fact that personal data is publicly accessible” does not mean the data subject manifestly made it public; there must be an intentional, explicit, clear affirmative act. EDPB ChatGPT Taskforce report PDF.

EDPB Opinion 28/2024 also recalls that Article 9(1) prohibits processing special-category data unless an Article 9(2) derogation applies. EDPB Opinion 28/2024 PDF.

Application. Large scraped corpora often contain health data, political opinions, religion, sexual orientation, union membership, racial or ethnic information, biometric identifiers, or inferred sensitive attributes. Article 9 risk can arise even if the controller did not deliberately target sensitive data, because the processing operation may still collect and train on such data.

Practical controls should include source exclusions, category filters, social-media and forum restrictions, sensitive-data classifiers, sampling audits, deletion before training, source-level blocklists, opt-out mechanisms, and post-training extraction tests. Where sensitive data is processed for bias detection or correction in high-risk AI contexts, the controller must analyze the exact AI Act and GDPR basis and safeguards; this is not a general permission to train on special-category data.

The Digital Omnibus proposal and EDPB-EDPS Joint Opinion 2/2026 discuss a possible future derogation for incidental and residual processing of special-category data in AI contexts, but that is not current enacted GDPR law. The Joint Opinion expressly treats it as a proposal and recommends clarifications and safeguards. EDPB-EDPS Joint Opinion 2/2026 PDF.

Limitations and counterarguments. A controller may argue that filtering makes Article 9 immaterial. That requires proof of filter effectiveness, not just policy language. Sampling, false-negative analysis, and deletion logs matter.

What transparency and data-subject-rights obligations apply?

Conclusion. GDPR transparency and rights obligations apply across the AI lifecycle. The controller must provide Article 13 information for direct collection and Article 14 information for indirect collection, unless a narrow exemption applies. The controller must also operationalize access, rectification, erasure, restriction, portability where applicable, and objection.

Rule. The EDPB ChatGPT Taskforce treats scraping and direct user interaction as distinct processing contexts and states that processing stages should be separately assessed. It also states that prompts, file uploads, and user feedback may be used for training only where the user is clearly and demonstrably informed, which affects the Article 6(1)(f) balancing test. EDPB ChatGPT Taskforce report PDF.

For Article 6(1)(f), the Article 21 right to object must be practically meaningful. EDPB-EDPS Joint Opinion 2/2026 emphasizes that, in AI contexts, advance notice and effective objection matter because it may be technically difficult to remove personal data once retained in a model. EDPB-EDPS Joint Opinion 2/2026 PDF.

Application. A controller should maintain an AI data inventory covering source categories, personal-data categories, sensitive-data risks, purposes, lawful bases, retention, recipients, model-improvement use, and data-subject-rights workflows. Public privacy notices should not merely say “we use data for AI.” They should distinguish inference-only use from training, fine-tuning, evaluation, safety monitoring, and product improvement.

Rights workflows should cover all relevant stores: raw datasets, cleaned datasets, labels, evaluation sets, prompt logs, abuse-monitoring logs, retrieval stores, embeddings, fine-tuning datasets, and output suppression systems. For model-level rights requests, a controller should have a position on whether erasure can be achieved by deleting source data, suppressing outputs, updating retrieval stores, fine-tuning, retraining, or other technical measures.

Limitations and counterarguments. Article 14(5)(b) may be invoked in some large-scale indirect-collection scenarios, but it is not a general exemption from transparency. The controller still needs safeguards and accountability evidence.

When does AI deployment trigger Article 22?

Conclusion. Article 22 risk arises where AI is used to make solely automated decisions producing legal or similarly significant effects. It also arises where a score or model output is not formally the final decision but plays a determining role in a consequential decision by another actor.

Rule. In SCHUFA, the CJEU held that automated establishment of a probability value by a credit information agency can be “automated individual decision-making” where a third party draws strongly on that value to establish, implement, or terminate a contractual relationship. SCHUFA, C-634/21, EUR-Lex.

In Dun & Bradstreet Austria, the CJEU interpreted Article 15(1)(h) as requiring meaningful information about the logic involved in automated decision-making. The information must explain the procedure and principles actually applied to the data subject’s personal data, rather than merely providing generic or opaque statements. Dun & Bradstreet Austria, C-203/22, EUR-Lex.

Application. High-risk deployment contexts include credit, employment, insurance, education, healthcare triage, fraud blocking, housing, access to essential services, welfare benefits, immigration, law enforcement, and platform account termination where serious consequences follow. “Human review” does not avoid Article 22 if the human reviewer rubber-stamps the model output or lacks authority, training, or information to depart from it.

Generative AI can trigger Article 22 where the generated output becomes the basis for a consequential automated decision. For example, a model that drafts a customer-service response usually does not trigger Article 22 by itself, but a model that automatically denies insurance coverage, rejects job applicants, blocks a bank account, or assigns a risk score that determines service access may.

Limitations and counterarguments. Article 22 is not the only automated-decision issue. Even outside Article 22, controllers remain bound by fairness, transparency, accuracy, access, rectification, objection, DPIA, and security obligations.

What are the consequences of unlawful training?

Conclusion. Unlawful development processing can lead to corrective orders affecting both datasets and the model. If personal data is retained in or processed by the model, later deployment may also be affected.

Rule. EDPB Opinion 28/2024 states that, where an infringement is found, supervisory authorities may impose corrective measures to remediate the unlawfulness of the initial processing, including fines, temporary processing limits, erasure of part of a dataset, erasure of the whole dataset, erasure of the AI model itself, or retraining. EDPB Opinion 28/2024 PDF.

The Opinion also states that the analysis is case-specific and depends on whether personal data is retained in the model and whether subsequent processing is performed by the same or another controller.

Application. A controller that unlawfully scrapes personal data and trains a model cannot assume that deleting the original dataset cures the violation. If the model retains personal data, later operation may continue the personal-data processing. Corrective measures may therefore need to address the model, not only the source corpus.

Model purchasers, deployers, and integrators should conduct provenance due diligence. Relevant questions include whether training data was licensed or scraped, whether special-category data was processed, whether Article 6/9 assessments exist, whether opt-outs were honored, whether rights requests can be handled, whether extraction testing was conducted, and whether any supervisory authority has made findings about the model.

Limitations and counterarguments. If a model is genuinely anonymous and deployment does not process personal data, GDPR may not apply to the later operation. That does not eliminate historical liability for unlawful development processing, and the anonymity claim must be substantiated.

How does the AI Act interact with GDPR?

Conclusion. The AI Act adds AI-specific obligations but does not replace GDPR. AI Act compliance documentation may support governance evidence, but it does not prove GDPR compliance.

Rule. The AI Act is a binding EU regulation laying down harmonised AI rules. GDPR remains the EU framework for processing personal data. EDPB Opinion 28/2024 treats GDPR analysis separately from AI Act compliance, and EDPB-EDPS Joint Opinion 2/2026 discusses proposed AI/GDPR amendments as proposals rather than current law. AI Act - Regulation (EU) 2024/1689, EUR-Lex.

Application. For high-risk AI, the AI Act may require risk management, data governance, documentation, logging, transparency, human oversight, accuracy, robustness, and cybersecurity. Those obligations should be mapped against GDPR Article 5 principles, Article 6 lawful basis, Article 9 exceptions, Articles 13/14 transparency, Articles 15-22 rights, Article 25 design/default, Article 32 security, and Article 35 DPIA duties.

For general-purpose AI models, AI Act documentation and training-content-summary obligations may improve transparency, but they do not independently answer whether training data was lawfully collected or whether the model processes personal data.

More from the journal

See all

MAS and Industry Publish AI Risk Management Toolkit for Singapore Financial Sector, 2026

The Monetary Authority of Singapore concluded Project MindForge Phase 2 in early 2026, publishing an AI Risk Management Operationalisation Handbook developed with a consortium of 24 banks, insurers, and capital market firms. The handbook provides practical implementation guidance across traditional AI, generative AI, and agentic AI systems, and applies alongside the MAS Guidelines for Artificial Intelligence Risk Management to establish supervisory expectations for Singapore-regulated financial institutions.

House of Lords Committee Publishes Report on UK Stablecoin Regulation, 3 June 2026

On 3 June 2026, the House of Lords Financial Services Regulation Committee published 'Stablecoins: waiting for regulation,' assessing the Bank of England's and the Financial Conduct Authority's proposed regulatory regimes for stablecoins in the UK. The Committee broadly supports the proposals but recommends reconsideration of holding limits, the requirement for unremunerated backing assets, and the proposed restriction on commercial banks issuing fiat-backed stablecoins.

FCA and Bank of England Call for Input on UK Wholesale Market Tokenisation, May 2026

On 18 May 2026, the Financial Conduct Authority, the Bank of England, and the Prudential Regulation Authority published a joint call for input setting out a shared vision for the safe adoption of tokenisation in UK wholesale financial markets. The consultation covers tokenised bonds, equities, and fund units and closes 3 July 2026. Responses will inform a joint roadmap aligned with the Government's Wholesale Financial Markets Digital Strategy.

Ready to launch without the regulatory guesswork?

Book a 30-minute consultation. We'll map your AI or licensing path and tell you exactly what's required, in plain language.