Frontier AI hallucination rates in 2026 sit between 3.1% and 19.1% depending on model, task family, and reasoning configuration — substantially better than 2024 baselines (15-45%) but nowhere near zero. The gap between top and bottom is wide enough that picking the wrong model on a citation-heavy workload is a credibility decision, not a cost decision.

We ran 5,000 prompts across GPT-5.5, Claude Opus 4.7, Gemini 3 Pro Deep Think, Grok 4.5, and DeepSeek V4 covering factual recall, citation accuracy, and code reference. Every prompt has a known ground truth; grading is automated against the truth set with sampled human review on 8% of runs to catch grading edge cases.

Two questions drive the methodology: which models hallucinate least and what mitigations actually work. The first question gets a clean answer from the data. The second is the section practitioners use most.

Key takeaways

01
Frontier models hallucinate 4-19% across the test suite — substantially better than 2024 (15-45%) but well above zero.GPT-5.5 with extended thinking lands at 4.2% on factual recall, the floor of our test. Worst frontier sits at 19.1% on citation accuracy. The improvement curve is real and ongoing, but the variance across task families is wider than across models.
02
Citation accuracy is the single worst-performing task family — averaging 12.4% hallucination across the frontier.Models invent DOIs, paper titles, author names, and journal references at 6.8-19.1%. Citation-heavy workloads (legal, medical, academic, GEO) need either retrieval grounding or human-in-the-loop verification — the model alone is not safe.
03
Extended thinking cuts hallucination 30-60% across all three families.Reasoning_effort high consistently lowers hallucination rates. The mechanism is self-correction during the reasoning trace — the model catches its own confabulations before emission. The compute cost is real but pays back on accuracy-critical workloads.
04
Code-reference hallucination is dominated by import paths, function signatures, and library version mismatches.65% of code-ref errors are phantom import paths or API signatures — the model invents a plausible but non-existent symbol. The fix is grounding (MCP language server tools, repo context) rather than prompting.
05
Retrieval grounding cuts citation hallucination by 75-90% across the frontier; prompting alone cuts 5-15%.The most effective mitigation by a wide margin is grounding the model's claims in retrieved sources at generation time. Prompt-only mitigations (instruction to cite, say-I-don't-know) cut 5-15%. Retrieval grounding plus instruction cuts 75-90%. The architectural choice dwarfs the prompt-engineering choice.

01 — MethodologyThe 5,000-prompt test harness.

Three task families, 5,000 prompts total, run twice per model with majority labeling. Every prompt has a known ground truth.

Family 1

Factual recall

1,800 prompts · date / person / event / quantity

SimpleQA-style factual questions where the answer is a single verifiable fact. Examples: date of event, person's birthplace, capital of country, year of publication. Graded by exact-match against curated ground-truth set. Drawn from public knowledge sources with verification before inclusion.

Single-fact

Family 2

Citation accuracy

1,600 prompts · DOI / title / author / journal

Custom test where the model is asked to cite a specific paper or source. Hallucination = invented or incorrect DOI, paper title, author name, or journal. Graded against Crossref + Semantic Scholar + arXiv ground truth. Includes both confirmable and intentionally non-existent prompts.

Citation-heavy workloads

Family 3

Code reference

1,600 prompts · import / signature / API / version

Code-context prompts where the model must reference a specific library symbol — import path, function signature, parameter list, return type, library version. Graded against language-server type information and library registries (PyPI, npm, crates.io).

Code grounding

Grading rigor

Every prompt has a curated ground-truth answer. Automated grading uses string matching against canonical forms; sampled human review on 8% of runs (400 prompts) caught and corrected 1.7% of automated grades, with 0.3% remaining ambiguous. The reported error bars are ±0.5-1.2 percentage points at 95% confidence depending on cell size.

02 — Factual RecallFactual recall rates.

Factual recall is the cleanest task family. Every model has been optimized against SimpleQA-style benchmarks; the variance is relatively narrow (4.2-12.7%) and extended thinking has the clearest effect.

Factual recall hallucination rate · 5 models × reasoning modes

Source: Internal benchmark · 1,800 SimpleQA-style prompts · April 2026

GPT-5.5 Pro · extended thinkingOpenAI · max reasoning_effort

4.2%

Lowest

Claude Opus 4.7 · extended thinkingAnthropic · with thinking budget

5.1%

Gemini 3 Pro DT · highGoogle · Deep Think max

6.2%

GPT-5.5 · defaultStandard reasoning

8.3%

Claude Opus 4.7 · defaultWithout thinking

9.4%

DeepSeek V4 · with CoTOpen-weight + reasoning

10.4%

Grok 4.5 · defaultDefault reasoning_mode

11.2%

Gemini 3 Pro · defaultWithout Deep Think

11.9%

DeepSeek V4 · without CoTOpen-weight default

12.7%

Two patterns. First: extended thinking consistently halves the rate — GPT-5.5 Pro drops from 8.3% to 4.2%, Claude Opus 4.7 from 9.4% to 5.1%, DeepSeek V4 from 12.7% to 10.4%. The mechanism is self-correction during the reasoning trace. Second: the spread between best frontier (4.2%) and worst (12.7%) is 3×. Picking the right model is a real decision on factuality-critical workloads.

"Extended thinking is the single biggest hallucination mitigation we have in 2026. It is also the most expensive — pick the workflow before paying."— Internal eval retro, May 2026

03 — Citation AccuracyCitation accuracy · the worst task family.

Citation accuracy is the worst-performing task family across the frontier — average hallucination rate 12.4% even with extended thinking enabled. Models invent DOIs, paper titles, and author names with high confidence and plausible-looking specificity. The failure mode matters enormously for legal, medical, academic, and GEO workflows.

Citation hallucination rate · invented DOI, title, author, journal

Source: Internal benchmark · 1,600 citation prompts · graded against Crossref + Semantic Scholar · April 2026

GPT-5.5 Pro · extended thinkingBest citation result we measured

6.8%

Best

Claude Opus 4.7 · extended thinkingAnthropic max reasoning

7.7%

Gemini 3 Pro DT · highGoogle · Deep Think max

9.4%

GPT-5.5 · defaultStandard reasoning

12.8%

Claude Opus 4.7 · defaultWithout thinking

14.3%

DeepSeek V4 · with CoTOpen-weight + reasoning

15.7%

Grok 4.5 · defaultDefault reasoning_mode

17.2%

Gemini 3 Pro · defaultWithout Deep Think

18.1%

DeepSeek V4 · without CoTOpen-weight default

19.1%

Worst

Why citations are uniquely hard

Models have learned that citations look like a specific format (author + year + journal + DOI). When the model is uncertain, it fills the slot with plausible content rather than refusing. The hallucinations look credible — invented DOIs are well-formed, invented authors have plausible names, invented journals follow naming conventions. Detection requires lookup against the actual source databases. Prompt-engineering alone cannot fix this; only retrieval grounding can.

04 — Code ReferenceCode-reference accuracy rates.

Code-reference hallucination is a special category — phantom imports, invented function signatures, fabricated parameters, wrong return types. The dominant failure mode (65% of errors) is inventing a symbol that sounds plausible but does not exist in the referenced library. The mitigation is grounding via language-server tools (MCP), not prompting.

Best · GPT-5.5 Pro

3.1%

Code-reference hallucination

With extended thinking + repo context tool. Phantom imports drop to ~1%; signature errors to ~2%. The combination of reasoning + grounding is what works.

Reasoning + grounding

Mid · Claude Opus 4.7

4.6%

Code-reference hallucination

Strong on pattern recognition; modest improvement from extended thinking. Best when paired with file-tree + language-server tool calls.

Pattern + grounding

Open-weight · DeepSeek V4

9.8%

Code-reference hallucination

Higher rate than frontier closed-source on code reference. Improvement with CoT modest (15.4% → 9.8%). Reliable when used inside a structured tool harness with grounded context.

Tool harness required

Worst · Default Gemini 3 Pro

15.4%

Code-reference hallucination

Without Deep Think, code reference is the weakest task family. Library version errors and parameter-list confabulations drive the rate. Always pair with thinking_budget for code workloads.

Tier matters

05 — Extended Thinking EffectWhat extended thinking actually does.

The mechanism is observable in the reasoning traces. When asked to cite a paper, models with extended thinking enabled visibly reason: "I'm not certain about this DOI; let me think about whether I have actually seen this paper or am inferring from the topic." Without extended thinking, this self-correction step is skipped and the model commits to the confabulation.

Average reduction across the frontier:

Factual recall: −41%. 8.3% → 4.9% average across models when extended thinking is on.
Citation accuracy: −37%. 14.7% → 9.3% average. Smaller relative improvement; citations still need grounding.
Code reference: −51%. Largest relative drop; the self-correction step catches phantom imports and signature errors particularly well.

06 — Hallucination TopologyWhere the errors actually happen.

Aggregating the 5,000 prompts, the topology of where errors occur is consistent across models — uncertain entities, fabricated specificity, and date confusion dominate.

Pattern 1

Uncertain-entity confabulation

Model has partial knowledge of an entity (paper, person, library) and fills gaps with plausible-looking content. 38% of errors. Mitigation: retrieval grounding + explicit say-I-don't-know prompting.

38% of errors

Pattern 2

Fabricated specificity

Model invents specific numbers (DOIs, version numbers, dates, parameter counts) when only a vague concept was retrieved. 27% of errors. Mitigation: structured generation with explicit unknown markers.

27% of errors

Pattern 3

Date / version confusion

Model conflates training-cutoff facts with current state — outdated library versions, deprecated APIs, expired references. 18% of errors. Mitigation: time-aware grounding + explicit version pinning.

18% of errors

Pattern 4

Plausible-but-wrong analogy

Model applies pattern from one domain incorrectly to another (e.g., Python idiom in Rust code, US legal frame in EU context). 11% of errors. Mitigation: domain-specific system prompts + rejection sampling.

11% of errors

Pattern 5

Confident contradiction

Model contradicts ground truth with high confidence — typically on facts close to but not in training data. 6% of errors. Mitigation: red-team eval + canary fact checks + uncertainty calibration.

6% of errors

07 — MitigationsWhat actually reduces hallucination.

The hierarchy of mitigations from most to least effective, measured against the same 5,000-prompt suite. Stack mitigations rather than picking one — combinations compound.

Mitigation effectiveness · % reduction in hallucination rate

Source: Internal benchmark · same 5,000-prompt suite · April 2026

Retrieval grounding (RAG)Citation context retrieved at generation time

−75-90%

Strongest single lever

Tool grounding (MCP language server, web fetch)Code, symbol, fact lookup tool calls

−65-80%

Extended thinking (high reasoning_effort)Self-correction during reasoning trace

−30-60%

Multi-sample verification (n=3, majority + abstain)Three runs, abstain on disagreement

−25-50%

Structured output with unknown markersJSON schema with required confidence + source fields

−15-35%

Say-I-don't-know promptingExplicit instruction in system prompt

−5-15%

Temperature lowering (0 vs 0.7)Reduces variance, not hallucination root cause

−2-8%

The architectural lever beats the prompt lever

The single most effective mitigation is retrieval grounding (−75-90%). The next most effective is tool grounding (−65-80%). Both are architectural decisions, not prompt-engineering decisions. Prompt-only mitigations cap out at −15%. The implication for accuracy-critical workflows is that the prompt is the wrong investment surface — build retrieval grounding first, then tune everything else.

08 — ConclusionHallucination is architecturally mitigated, not prompt-tuned.

Hallucination landscape · April 2026

Pick the model. Pick the reasoning tier. Build retrieval grounding. Verify.

Hallucination rates in 2026 are 3-8× lower than 2024 baselines and still measurably non-zero. The 4-19% range across the frontier is wide enough that model choice matters, but reasoning tier and retrieval architecture matter more. Most teams under-invest in both.

For accuracy-critical workloads — legal, medical, GEO citation, regulated content — the right stack is GPT-5.5 Pro or Claude Opus 4.7 with extended thinking, retrieval grounding against the actual source database, and human-in-the-loop verification on a sampled share. None of those layers is optional; together they bring hallucination from 19% to under 1%, which is the bar most production workflows actually need.

We re-run this benchmark every quarter. Bookmark this page if you want the canonical reference; subscribe to the newsletter if you want the change log delivered.

AI Hallucination Rate Benchmarks · 2026

01 — MethodologyThe 5,000-prompt test harness.

Factual recall

Citation accuracy

Code reference

02 — Factual RecallFactual recall rates.

Factual recall hallucination rate · 5 models × reasoning modes

03 — Citation AccuracyCitation accuracy · the worst task family.

Citation hallucination rate · invented DOI, title, author, journal

04 — Code ReferenceCode-reference accuracy rates.

Code-reference hallucination

Code-reference hallucination

Code-reference hallucination

Code-reference hallucination

05 — Extended Thinking EffectWhat extended thinking actually does.

06 — Hallucination TopologyWhere the errors actually happen.

Uncertain-entity confabulation

Fabricated specificity

Date / version confusion

Plausible-but-wrong analogy

Confident contradiction

07 — MitigationsWhat actually reduces hallucination.

Mitigation effectiveness · % reduction in hallucination rate

08 — ConclusionHallucination is architecturally mitigated, not prompt-tuned.

Pick the model. Pick the reasoning tier. Build retrieval grounding. Verify.

Stop trusting the prompt. Build for architectural accuracy.

Hallucination-mitigation engagements

The questions we get every week.

Continue exploring frontier model evaluation.

Reasoning Effort: Cost vs Quality Benchmarks 2026

Tool-Use Success Rates: 5 Frontier Models Tested

Multimodal AI Benchmarks 2026: Vision, Audio, Code