Synthetic data generation for LLM training sits at an awkward intersection of compelling capability evidence and genuine academic risk — model collapse is a peer-reviewed phenomenon, not a vendor scare story. But the peer-reviewed literature also contains a rigorous counterargument: collapse is avoidable, the fix is straightforward, and the teams at Microsoft, Hugging Face, and Anthropic have been quietly validating synthetic approaches at production scale for years.

The stakes are real. A 1.3-billion-parameter model trained on one billion synthetic “textbook-quality” tokens matched or exceeded models ten times its size on coding benchmarks — but only because the training data was curated with discipline, not indiscriminately generated. At the other end, experiments with OPT-125m showed perplexity degrading by 20–28 points after five epochs of purely synthetic self-training, with no real data retained. Both outcomes are real. Which one you get depends on the generation regime.

This guide gives practitioners a decision framework across four distinct use cases — fine-tuning instruction data, synthetic eval set creation, edge-case augmentation, and privacy substitution — each with its own failure modes, quality metrics, and generation techniques. It also covers the “accumulate, don't replace” rule that the model collapse literature actually supports, the Phi series as the strongest public proof of concept, and the vendor policy questions that remain genuinely open.

Key takeaways

01
Model collapse is avoidable — the fix is additive, not abstinent.Gerstgrasser et al. (2024) proved analytically that test error has a finite upper bound when synthetic data accumulates alongside real data, but grows without bound when it replaces real data. The rule is accumulate, not replace — not 'avoid synthetic data.'
02
Phi-1 at 1.3B params matched models 10× larger on coding benchmarks.Microsoft trained phi-1 on 6B tokens of curated web data plus 1B synthetically generated textbooks and exercises via GPT-3.5, achieving 50.6% pass@1 on HumanEval and 55.5% on MBPP. Vendor-stated numbers from arXiv:2306.11644.
03
The four use cases have fundamentally different risk profiles.Fine-tuning data, eval set creation, edge-case augmentation, and privacy substitution each carry different failure modes — from distribution drift to eval overfitting to GDPR pseudonymization risk. A single strategy does not span all four.
04
Cosmopedia is the only large open synthetic pre-training dataset.Hugging Face generated 25 billion tokens across 30M+ files using Mixtral-8x7B-Instruct, with less than 1% duplicate content — by varying audience and format rather than just topic. Total compute exceeded 10,000 H100 GPU hours.
05
Vendor policy on using model outputs for training remains a live question.OpenAI's Usage Policies (effective October 2025) do not explicitly prohibit using GPT outputs to train other AI models in general — but using outputs to build a direct OpenAI competitor violates ToS. Verify current terms before any training run.

01 — ContextWhy synthetic data, why now.

Three pressures have converged to make synthetic data generation a mainstream consideration in 2026. First, frontier models have made high-quality generation cheap: GPT-3.5-level outputs that would have been expensive to produce in 2022 are now commodity compute. Second, real data scarcity has become a genuine constraint — the best instruction-following datasets require expensive human annotation, and the most useful domain-specific corpora are often proprietary, legally encumbered, or simply too small to train on. Third, the privacy and compliance landscape has become less forgiving: GDPR enforcement of Article 4's definition of personal data extends to any dataset where re-identification is plausible, pushing practitioners toward privacy-substitution alternatives that synthetic data can potentially satisfy.

The “textbook quality” framing that Microsoft used in the Phi-1 paper — generating synthetic content that is educationally coherent, diverse, and didactically structured — represents the clearest public statement of what makes synthetic data work: not volume, but curation discipline. The same principle applies across all four use cases in this guide. Volume without curation amplifies failure modes; curation at modest scale can match or exceed uncurated real data at large scale.

The foundational finding

Shumailov et al. in Nature (July 2024) demonstrated that model collapse is universal across generative model families — VAEs, Gaussian Mixture Models, and LLMs all exhibit it, making it a fundamental property of iterative generative training. The key quote: “indiscriminate use of model-generated content in training causes irreversible defects in the resulting models.” The operative word is “indiscriminate.”

02 — Model CollapseWhat model collapse actually is.

Model collapse is a degenerative process in which successive generations of models trained on prior generations' outputs progressively lose the tails of the original data distribution, eventually converging toward a near-delta-function output — the model becomes less and less able to produce diverse, low-probability but valid outputs. The phenomenon was formally characterized in Nature (Shumailov et al., July 2024, DOI:10.1038/s41586-024-07566-y), a peer-reviewed study independent of any AI vendor.

The paper identifies three compounding error types behind the collapse: statistical approximation error (finite training samples), functional expressivity error (limited model capacity to represent the full distribution), and functional approximation error (biases introduced by the training procedure). All three errors compound across training generations, meaning early-generation models that seem fine produce second and third-generation models that are noticeably degraded.

The practical signal in OPT-125m experiments: five-epoch training with no real data retained resulted in perplexity increases of 20–28 points. The collapse is not gradual degradation that benchmarks catch early — it can appear mild for several generations and then accelerate. This is why practitioners who “tested it on one round of fine-tuning and it seemed fine” may be building a delayed failure.

Model collapse risk by training regime

Sources: Shumailov et al. Nature 2024; Gerstgrasser et al. arXiv:2404.01413

Fully replace real data with syntheticNo real data retained — the replacement regime

High

Replace real data, retain 10% realOPT-125m experiment with 10% retention

Medium

Accumulate synthetic alongside real (additive)Gerstgrasser et al. — bounded error proof

Low

Deduped + quality-filtered synthetic + realCosmopedia / Phi-3 production regime

Very low

The collapse risk gradient above is not arbitrary. The Gerstgrasser et al. paper (arXiv:2404.01413, April 2024) proved analytically that test error has a finite upper bound independent of the number of training iterations when synthetic data accumulates alongside real data. When synthetic data replaces real data, error grows without bound. The difference between these two regimes is not a matter of degree — it is a topological difference in the training process.

03 — The Core RuleAccumulate, don't replace.

The most actionable single takeaway from the model collapse literature is also the most underexplained. Gerstgrasser et al.'s explicit finding: “We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse.” This is a provably different regime, not just an empirically better practice.

The practical implication for fine-tuning teams: retaining the original real-data seed in every subsequent training run is not optional. Preserving even 10% of original real training data in each fine-tuning cycle dramatically limits performance degradation versus training purely on synthetic generations. This percentage was empirically validated in the Nature paper experiments on OPT-125m — note the original paper used this specific model for the perplexity experiments; your model architecture may produce different numerical deltas.

The second mitigation — and one that the Cosmopedia team validated at scale — is diversity forcing at generation time. A single topic prompted twelve different ways (four target audiences × three generation styles) produced less than 1% duplicate content in Cosmopedia's 25-billion-token corpus. Simple prompt variation without explicit format and content instructions, by contrast, still yielded high duplicate rates. Deduplication and quality filtering are not just hygiene — they are the mechanism that keeps the effective-distribution of synthetic data broad enough to avoid collapse.

"The value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet."— Shumailov et al., Nature, July 2024

04 — Decision MatrixThe four use cases, mapped to failure modes.

The practical reason most synthetic data advice is too generic is that it conflates four distinct use cases that have fundamentally different risk profiles, optimal generation techniques, and quality metrics. The matrix below gives decision rules for each. Read it as a per-use-case checklist rather than a single strategy.

Fine-tuning data

Instruction datasets

Generate instruction-response pairs via a teacher model (GPT-4-class or Mixtral), then filter by quality signal (perplexity, reward model score, or human spot-check). Primary failure mode: distribution drift — synthetic formal language diverges from real casual user inputs, degrading real-world performance even when benchmarks improve. Mitigation: seed prompts from real user queries, not just topic lists.

Use distilabel + seed from real queries

Eval sets

Benchmark bootstrapping

Synthetic eval sets are particularly useful for adversarial and edge-case test inputs at scale where human annotation is prohibitively expensive. Primary failure mode: eval overfitting — if the same model generates both training data and eval data, the eval will be optimistic rather than challenging. Mitigation: use a different model family for eval generation than for training.

Different model family for eval gen

Edge-case augmentation

Adversarial coverage

Use targeted generation to fill known coverage gaps in the real training set — rare formats, low-frequency domains, adversarial prompts. Model collapse risk is lower here because synthetic data supplements rather than replaces the real-data base. Primary failure mode: the augmented distribution drifts toward model-favored outputs rather than genuine hard cases. Mitigation: verify augmented distribution against real held-out edge cases.

Supplement, never replace the base

Privacy substitution

GDPR-safe alternatives

AI-generated synthetic data can satisfy GDPR's anonymization bar (Recital 26) when the generation process breaks the statistical link between synthetic records and identifiable individuals — but this requires formal verification via Distance-to-Closest-Record (DCR) metrics, not just intuition. Pseudonymization (reversible with the right key — Article 4) is still personal data. Primary failure mode: assuming generation = anonymization without verification.

Verify DCR before claiming anonymization

The distribution drift failure mode — the third row above being the lower-risk case — deserves emphasis because it is subtly different from model collapse. Distribution drift occurs when the synthetic data distribution diverges from the actual deployment distribution even without recursive self-training. A model fine-tuned on synthetically generated formal technical documentation may perform well on structured benchmarks while degrading on real user inputs that are casual, fragmented, and context-dependent. This does not show up as model collapse; it shows up as a benchmark-to-production gap. The mitigation is seeding generation prompts from real examples of the target distribution, not from curated topic lists alone.

For teams working on AI transformation programs where fine-tuning custom models is part of the roadmap, the eval-set use case is often the highest-value entry point. Building a synthetic eval set first — before any fine-tuning — lets you measure the baseline accurately and validate whether fine-tuning on synthetic data actually moves the metrics you care about.

05 — Proof of ConceptPhi-1 through Phi-3: the strongest public evidence.

The Phi series (Microsoft Research, 2023–2024) is the most thoroughly documented public proof that synthetic pre-training data can produce models that outperform significantly larger ones — with the caveat that all benchmark numbers below are vendor-stated from Microsoft technical reports and have not all been independently replicated.

Phi-1 (arXiv:2306.11644, June 2023) established the “textbook quality” paradigm: 6 billion tokens of curated web data plus 1 billion synthetic textbooks and exercises generated via GPT-3.5, trained on 8 A100 GPUs for 4 days. At 1.3B parameters it achieved 50.6% pass@1 on HumanEval and 55.5% on MBPP — competitive with models in the 10B+ range at the time. The Phi-1.5 report later specified 20,000 curated topics seeded with web samples, targeting approximately 20 billion synthetic tokens — but the training data was never publicly released, motivating the open Cosmopedia replication effort.

phi-1 (June 2023)

HumanEval pass@1

50.6%

1.3B parameters, 1B synthetic tokens via GPT-3.5, 4 days on 8 A100s. Vendor-stated (Microsoft). Comparable to models 10× larger at the time.

arXiv:2306.11644

phi-3-mini (April 2024)

MMLU score at 3.8B

69%

Trained on 3.3 trillion tokens with heavy synthetic data use. Comparable to Mixtral 8×7B and GPT-3.5 — vendor-stated (Microsoft, arXiv:2404.14219). Small enough to run on a phone.

3.8B parameters

phi-3-medium (Aug 2024)

MMLU score at 14B

78%

Trained on 4.8T tokens on scaled-up phi-3 synthetic dataset. 8.9 MT-bench. Vendor-stated (Microsoft). Phi-3.5-MoE (6.6B active) reportedly on par with Gemini-1.5-Flash and GPT-4o-mini.

arXiv:2404.14219

The reason the Phi series is the right benchmark, not just an impressive benchmark, is that Microsoft published enough methodological detail to understand what caused the results: topic diversity, explicit audience targeting, didactic structure in generation prompts, and quality filtering. The Cosmopedia team at Hugging Face reverse-engineered and open-sourced this methodology — their 25-billion-token corpus, generated from 30M+ distinct prompts, produced a 1B model that outperforms TinyLlama 1.1B on four standard benchmarks. The gap versus Phi-1.5 remains, which the Cosmopedia team attributes to the undisclosed proprietary filtering pipeline Microsoft never published.

The interpretive signal is this: synthetic pre-training data can produce outsized performance gains, but the mechanism is curation discipline and diversity, not generation volume. The Phi results were achieved at lower total token counts than standard pre-training runs by compensating with higher-quality signal per token. This is both the promise and the design constraint of the approach.

06 — ToolchainGeneration tools for each use case.

The generation toolchain question is more stratified in 2026 than it was two years ago. Three distinct pathways have emerged: direct prompting of frontier models with quality filtering, open-source pipeline frameworks, and vendor-native distillation APIs.

Open-source

Distilabel by Argilla

GitHub: argilla-io/distilabel

Apache 2.0 framework for synthetic data and AI feedback pipelines. Supports scalable generation based on verified research techniques including preference datasets, instruction datasets, and quality-filtered data. Integrated into the Hugging Face ecosystem. Best for: fine-tuning instruction data, preference data for RLHF, eval set construction.

Recommended for fine-tuning + eval use cases

Vendor API

OpenAI Distillation

store: true in Chat Completions API

Announced late 2024. Stores high-quality large-model completions for 30 days via the store: true parameter, then fine-tunes smaller models directly on those outputs. A vendor-blessed pathway that bypasses the ToS grey area for training on GPT outputs. A few hundred samples may suffice; thousands of diverse samples produce better results.

Best for: small-model distillation from GPT-4-class

Open dataset

Hugging Face Cosmopedia

25B tokens, 30M+ files, <1% dupes

The largest open synthetic pre-training dataset. Generated by Mixtral-8x7B-Instruct-v0.1 from 23M+ web-conditioned prompts and curated educational sources. Benchmark decontamination pipeline using 10-gram overlap detection — less than 4 contaminated samples found for MMLU, OpenBookQA, WinoGrande. Best for: pre-training or domain-adaptive continued pre-training on open data.

Best for: pre-training, research, open models

For teams evaluating fine-tuning use cases where synthetic data applies, the choice between these three pathways maps fairly cleanly to use case. The OpenAI Distillation API is the right answer when you want GPT-4-class quality in a smaller deployable model and are already paying for frontier API access. Distilabel is the right answer when you want a reproducible, auditable pipeline that generates preference data or instruction datasets from any teacher model. Cosmopedia is the right answer when you need open pre-training data for research or when data provenance requires a fully public-domain corpus.

Anthropic's Constitutional AI (December 2022) deserves mention as the earliest large-scale documented use of synthetic data for RLHF: the approach used an LLM to generate both critiques and revisions of its own outputs to produce harmlessness labels, replacing the need for human labelers on that axis. CAI is the template for the AI-feedback paradigm that Distilabel and similar frameworks have since systematized.

07 — Legal & PolicyVendor policy: what's actually in the ToS.

The terms-of-service question around using model outputs for training has been debated as if it were unresolved. For OpenAI specifically, the current Usage Policies (effective October 29, 2025, the most recent version retrieved) do not include an explicit general prohibition on using GPT outputs to train other AI models. What is explicitly prohibited is using OpenAI outputs to build a direct OpenAI competitor.

The practical interpretation — use GPT outputs to fine-tune a domain-specific private or open-source model? Not explicitly prohibited as of October 2025. Use GPT outputs to build a general AI assistant that competes with ChatGPT? Explicitly prohibited. This is not legal advice; OpenAI Terms of Service update regularly and the exact scope of “direct competitor” has not been litigated definitively. Verify the current terms before any training run, and document the version you reviewed.

The OpenAI Distillation API feature represents the clearest vendor-blessed pathway: the API is explicitly designed for this purpose, the outputs are captured in a structured way that demonstrates compliance intent, and the use case (smaller model distilled from larger) is the paradigm the feature was built for. For teams concerned about ToS ambiguity, starting with the Distillation API rather than arbitrary completions scraping resolves the concern by design.

For posts covering distillation and the legal risk of using competitor outputs, note that the same policy analysis applies to Anthropic model outputs: the Claude usage policy prohibits using Claude outputs to build competing AI products. The same “domain-specific fine-tuning vs. general competitor” distinction likely applies, but verify the specific current Claude usage policy before any training run that uses Claude-generated synthetic data.

GDPR and synthetic anonymization

The key legal distinction for privacy-substitution use cases: pseudonymization (reversible with the right key) is still personal data under GDPR Article 4. True anonymization (irreversible — Recital 26) is not personal data. AI-generated synthetic data can satisfy the anonymization bar when the generation process formally breaks the statistical link between synthetic records and identifiable individuals — but this requires verification via Distance-to-Closest-Record (DCR) metrics, not just the intuition that “it was generated, not copied.”

08 — Practitioner GuideDecision rules for your next training run.

The literature supports a set of operational rules that hold across all four use cases. These are distilled from the primary sources above, not from vendor recommendations.

Synthetic data operational checklist — by priority

Synthesized from: Shumailov et al. Nature 2024; Gerstgrasser et al. arXiv:2404.01413; HF Cosmopedia blog; MOSTLY AI benchmark framework

Seed from real user inputs, not topic lists aloneDistribution drift mitigation — aligns synthetic data to actual deployment distribution

Critical

Retain real data in every subsequent fine-tuning cycleModel collapse mitigation — even 10% real data retention reduces perplexity drift

Critical

Use a different model family for eval set generationEval overfitting mitigation — prevents the training model from gaming its own benchmark

High

Deduplicate and quality-filter before trainingCosmopedia methodology — diversity forcing via audience + format variation, not just topic variation

High

Verify DCR metrics before claiming GDPR anonymizationPrivacy substitution — generation ≠ anonymization without formal verification

Required (if regulated)

Document Terms of Service version reviewed before any runVendor policy — ToS updates regularly; timestamped documentation is defensible

Recommended

The most common mistake in synthetic data programs is treating generation as the hard problem and validation as an afterthought. The Cosmopedia team's experience is instructive: “most of the time for Cosmopedia was spent on meticulous prompt engineering,” not on scaling generation. A well-engineered generation prompt that produces diverse, accurate, and distribution-appropriate outputs at 100,000 samples is worth more than an indiscriminate generator at 10 million samples.

For teams evaluating whether synthetic fine-tuning makes sense relative to alternative knowledge-injection approaches, the comparison to RAG is worth framing explicitly. As covered in our RAG vs. fine-tuning TCO comparison, RAG avoids the data-quality risk entirely by deferring to a retrieval index at inference time — but at the cost of latency and retrieval complexity. Synthetic fine-tuning is the better option when you need low-latency, model-resident knowledge that does not change frequently. It is not a substitute for RAG when knowledge currency matters.

For teams working on CRM automation or similar domain-specific applications where labeled training examples are expensive to produce, the edge-case augmentation use case is often the highest ROI entry point: a small real-data seed plus synthetically generated coverage of rare but important cases can measurably improve model performance on the long tail of inputs that matter most to the business. Combine with the evaluation metrics reference to measure whether augmentation is actually moving the right numbers.

The verdict on synthetic data, May 2026

The risk is real, the fix is known, and the opportunity is large.

Model collapse is a peer-reviewed phenomenon, not a speculative concern. But the peer-reviewed literature also contains a rigorous answer: accumulating synthetic data alongside real data provably avoids it, while replacing real data with synthetic causes error to grow without bound. The distinction between these two regimes is everything. Teams that treat “accumulate, don't replace” as a hard constraint can use synthetic data aggressively; teams that treat it as a best-practice suggestion are building delayed failure.

The Phi series remains the clearest public evidence of what synthetic pre-training data can achieve — 1.3B parameters matching 10×-larger models on coding benchmarks by replacing volume with curation discipline. Microsoft never published the actual training data, but the Cosmopedia replication at Hugging Face has validated enough of the methodology to be operational. The gap versus Phi-1.5 that the Cosmopedia team acknowledges is almost certainly in the undisclosed quality filtering, not in the generation approach itself.

The four use-case matrix is the practical lens. Fine-tuning data, eval set creation, edge-case augmentation, and privacy substitution are not interchangeable problems; they carry different risk profiles, require different generation strategies, and fail in different ways. Picking the right strategy per use case — and the right validation metric for each — is more important than picking the right generation model. The generation model matters at the margin; the curation discipline matters by default.

Synthetic Data for LLM Training: Decision Guide 2026

01 — ContextWhy synthetic data, why now.

02 — Model CollapseWhat model collapse actually is.

Model collapse risk by training regime

03 — The Core RuleAccumulate, don't replace.

04 — Decision MatrixThe four use cases, mapped to failure modes.

Instruction datasets

Benchmark bootstrapping

Adversarial coverage

GDPR-safe alternatives

05 — Proof of ConceptPhi-1 through Phi-3: the strongest public evidence.

HumanEval pass@1

MMLU score at 3.8B

MMLU score at 14B

06 — ToolchainGeneration tools for each use case.

Distilabel by Argilla

OpenAI Distillation

Hugging Face Cosmopedia

07 — Legal & PolicyVendor policy: what's actually in the ToS.

08 — Practitioner GuideDecision rules for your next training run.

Synthetic data operational checklist — by priority

The risk is real, the fix is known, and the opportunity is large.

Curation discipline, not generation volume, is what makes synthetic data work.

Synthetic data program design

The questions we get every week.

Continue exploring LLM development guides.

LM Studio Ships Locally + LM Link: Local LLMs Go Mobile

Mistral Forge: Train Frontier AI on Enterprise Data

Fine-Tuning LLMs for Business: Complete Use Cases Guide

Claude Code Auto Mode Lands on Bedrock and Vertex AI