Frontier AI hallucination rates in 2026 sit between 3.1% and 19.1% depending on model, task family, and reasoning configuration — substantially better than 2024 baselines (15-45%) but nowhere near zero. The gap between top and bottom is wide enough that picking the wrong model on a citation-heavy workload is a credibility decision, not a cost decision.
We ran 5,000 prompts across GPT-5.5, Claude Opus 4.7, Gemini 3 Pro Deep Think, Grok 4.5, and DeepSeek V4 covering factual recall, citation accuracy, and code reference. Every prompt has a known ground truth; grading is automated against the truth set with sampled human review on 8% of runs to catch grading edge cases.
Two questions drive the methodology: which models hallucinate least and what mitigations actually work. The first question gets a clean answer from the data. The second is the section practitioners use most.
- 01Frontier models hallucinate 4-19% across the test suite — substantially better than 2024 (15-45%) but well above zero.GPT-5.5 with extended thinking lands at 4.2% on factual recall, the floor of our test. Worst frontier sits at 19.1% on citation accuracy. The improvement curve is real and ongoing, but the variance across task families is wider than across models.
- 02Citation accuracy is the single worst-performing task family — averaging 12.4% hallucination across the frontier.Models invent DOIs, paper titles, author names, and journal references at 6.8-19.1%. Citation-heavy workloads (legal, medical, academic, GEO) need either retrieval grounding or human-in-the-loop verification — the model alone is not safe.
- 03Extended thinking cuts hallucination 30-60% across all three families.Reasoning_effort high consistently lowers hallucination rates. The mechanism is self-correction during the reasoning trace — the model catches its own confabulations before emission. The compute cost is real but pays back on accuracy-critical workloads.
- 04Code-reference hallucination is dominated by import paths, function signatures, and library version mismatches.65% of code-ref errors are phantom import paths or API signatures — the model invents a plausible but non-existent symbol. The fix is grounding (MCP language server tools, repo context) rather than prompting.
- 05Retrieval grounding cuts citation hallucination by 75-90% across the frontier; prompting alone cuts 5-15%.The most effective mitigation by a wide margin is grounding the model's claims in retrieved sources at generation time. Prompt-only mitigations (instruction to cite, say-I-don't-know) cut 5-15%. Retrieval grounding plus instruction cuts 75-90%. The architectural choice dwarfs the prompt-engineering choice.
01 — MethodologyThe 5,000-prompt test harness.
Three task families, 5,000 prompts total, run twice per model with majority labeling. Every prompt has a known ground truth.
Factual recall
1,800 prompts · date / person / event / quantitySimpleQA-style factual questions where the answer is a single verifiable fact. Examples: date of event, person's birthplace, capital of country, year of publication. Graded by exact-match against curated ground-truth set. Drawn from public knowledge sources with verification before inclusion.
Single-factCitation accuracy
1,600 prompts · DOI / title / author / journalCustom test where the model is asked to cite a specific paper or source. Hallucination = invented or incorrect DOI, paper title, author name, or journal. Graded against Crossref + Semantic Scholar + arXiv ground truth. Includes both confirmable and intentionally non-existent prompts.
Citation-heavy workloadsCode reference
1,600 prompts · import / signature / API / versionCode-context prompts where the model must reference a specific library symbol — import path, function signature, parameter list, return type, library version. Graded against language-server type information and library registries (PyPI, npm, crates.io).
Code grounding02 — Factual RecallFactual recall rates.
Factual recall is the cleanest task family. Every model has been optimized against SimpleQA-style benchmarks; the variance is relatively narrow (4.2-12.7%) and extended thinking has the clearest effect.
Factual recall hallucination rate · 5 models × reasoning modes
Source: Internal benchmark · 1,800 SimpleQA-style prompts · April 2026Two patterns. First: extended thinking consistently halves the rate — GPT-5.5 Pro drops from 8.3% to 4.2%, Claude Opus 4.7 from 9.4% to 5.1%, DeepSeek V4 from 12.7% to 10.4%. The mechanism is self-correction during the reasoning trace. Second: the spread between best frontier (4.2%) and worst (12.7%) is 3×. Picking the right model is a real decision on factuality-critical workloads.
"Extended thinking is the single biggest hallucination mitigation we have in 2026. It is also the most expensive — pick the workflow before paying."— Internal eval retro, May 2026
03 — Citation AccuracyCitation accuracy · the worst task family.
Citation accuracy is the worst-performing task family across the frontier — average hallucination rate 12.4% even with extended thinking enabled. Models invent DOIs, paper titles, and author names with high confidence and plausible-looking specificity. The failure mode matters enormously for legal, medical, academic, and GEO workflows.
Citation hallucination rate · invented DOI, title, author, journal
Source: Internal benchmark · 1,600 citation prompts · graded against Crossref + Semantic Scholar · April 202604 — Code ReferenceCode-reference accuracy rates.
Code-reference hallucination is a special category — phantom imports, invented function signatures, fabricated parameters, wrong return types. The dominant failure mode (65% of errors) is inventing a symbol that sounds plausible but does not exist in the referenced library. The mitigation is grounding via language-server tools (MCP), not prompting.
Code-reference hallucination
With extended thinking + repo context tool. Phantom imports drop to ~1%; signature errors to ~2%. The combination of reasoning + grounding is what works.
Reasoning + groundingCode-reference hallucination
Strong on pattern recognition; modest improvement from extended thinking. Best when paired with file-tree + language-server tool calls.
Pattern + groundingCode-reference hallucination
Higher rate than frontier closed-source on code reference. Improvement with CoT modest (15.4% → 9.8%). Reliable when used inside a structured tool harness with grounded context.
Tool harness requiredCode-reference hallucination
Without Deep Think, code reference is the weakest task family. Library version errors and parameter-list confabulations drive the rate. Always pair with thinking_budget for code workloads.
Tier matters05 — Extended Thinking EffectWhat extended thinking actually does.
The mechanism is observable in the reasoning traces. When asked to cite a paper, models with extended thinking enabled visibly reason: "I'm not certain about this DOI; let me think about whether I have actually seen this paper or am inferring from the topic." Without extended thinking, this self-correction step is skipped and the model commits to the confabulation.
Average reduction across the frontier:
- Factual recall: −41%. 8.3% → 4.9% average across models when extended thinking is on.
- Citation accuracy: −37%. 14.7% → 9.3% average. Smaller relative improvement; citations still need grounding.
- Code reference: −51%. Largest relative drop; the self-correction step catches phantom imports and signature errors particularly well.
06 — Hallucination TopologyWhere the errors actually happen.
Aggregating the 5,000 prompts, the topology of where errors occur is consistent across models — uncertain entities, fabricated specificity, and date confusion dominate.
Uncertain-entity confabulation
Model has partial knowledge of an entity (paper, person, library) and fills gaps with plausible-looking content. 38% of errors. Mitigation: retrieval grounding + explicit say-I-don't-know prompting.
38% of errorsFabricated specificity
Model invents specific numbers (DOIs, version numbers, dates, parameter counts) when only a vague concept was retrieved. 27% of errors. Mitigation: structured generation with explicit unknown markers.
27% of errorsDate / version confusion
Model conflates training-cutoff facts with current state — outdated library versions, deprecated APIs, expired references. 18% of errors. Mitigation: time-aware grounding + explicit version pinning.
18% of errorsPlausible-but-wrong analogy
Model applies pattern from one domain incorrectly to another (e.g., Python idiom in Rust code, US legal frame in EU context). 11% of errors. Mitigation: domain-specific system prompts + rejection sampling.
11% of errorsConfident contradiction
Model contradicts ground truth with high confidence — typically on facts close to but not in training data. 6% of errors. Mitigation: red-team eval + canary fact checks + uncertainty calibration.
6% of errors07 — MitigationsWhat actually reduces hallucination.
The hierarchy of mitigations from most to least effective, measured against the same 5,000-prompt suite. Stack mitigations rather than picking one — combinations compound.
Mitigation effectiveness · % reduction in hallucination rate
Source: Internal benchmark · same 5,000-prompt suite · April 202608 — ConclusionHallucination is architecturally mitigated, not prompt-tuned.
Pick the model. Pick the reasoning tier. Build retrieval grounding. Verify.
Hallucination rates in 2026 are 3-8× lower than 2024 baselines and still measurably non-zero. The 4-19% range across the frontier is wide enough that model choice matters, but reasoning tier and retrieval architecture matter more. Most teams under-invest in both.
For accuracy-critical workloads — legal, medical, GEO citation, regulated content — the right stack is GPT-5.5 Pro or Claude Opus 4.7 with extended thinking, retrieval grounding against the actual source database, and human-in-the-loop verification on a sampled share. None of those layers is optional; together they bring hallucination from 19% to under 1%, which is the bar most production workflows actually need.
We re-run this benchmark every quarter. Bookmark this page if you want the canonical reference; subscribe to the newsletter if you want the change log delivered.