SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentOriginal Research4 min readPublished Apr 23, 2026

5 frontier models · 3 task families · 5,000 prompts · graded by automated + human review

AI Hallucination Rate Benchmarks · 2026

Original 5,000-prompt benchmark across five frontier models measuring hallucination on three task families — factual recall, citation accuracy, and code reference. Confidence bands tighten with extended thinking but never go to zero. The data, the methodology, and the mitigations that actually move the needle.

DA
Digital Applied Team
Senior strategists · Published Apr 23, 2026
PublishedApr 23, 2026
Read time4 min
SourcesSimpleQA · FACTS · custom citation set
Best factual rate
4.2%
GPT-5.5 Pro · with extended thinking
−66% vs default
Worst citation rate
19.1%
Worst frontier · uncited claims
Code-ref hallucination
3.1–15.4%
phantom imports, APIs, signatures
Prompts tested
5,000
across 3 task families

Frontier AI hallucination rates in 2026 sit between 3.1% and 19.1% depending on model, task family, and reasoning configuration — substantially better than 2024 baselines (15-45%) but nowhere near zero. The gap between top and bottom is wide enough that picking the wrong model on a citation-heavy workload is a credibility decision, not a cost decision.

We ran 5,000 prompts across GPT-5.5, Claude Opus 4.7, Gemini 3 Pro Deep Think, Grok 4.5, and DeepSeek V4 covering factual recall, citation accuracy, and code reference. Every prompt has a known ground truth; grading is automated against the truth set with sampled human review on 8% of runs to catch grading edge cases.

Two questions drive the methodology: which models hallucinate least and what mitigations actually work. The first question gets a clean answer from the data. The second is the section practitioners use most.

Key takeaways
  1. 01
    Frontier models hallucinate 4-19% across the test suite — substantially better than 2024 (15-45%) but well above zero.GPT-5.5 with extended thinking lands at 4.2% on factual recall, the floor of our test. Worst frontier sits at 19.1% on citation accuracy. The improvement curve is real and ongoing, but the variance across task families is wider than across models.
  2. 02
    Citation accuracy is the single worst-performing task family — averaging 12.4% hallucination across the frontier.Models invent DOIs, paper titles, author names, and journal references at 6.8-19.1%. Citation-heavy workloads (legal, medical, academic, GEO) need either retrieval grounding or human-in-the-loop verification — the model alone is not safe.
  3. 03
    Extended thinking cuts hallucination 30-60% across all three families.Reasoning_effort high consistently lowers hallucination rates. The mechanism is self-correction during the reasoning trace — the model catches its own confabulations before emission. The compute cost is real but pays back on accuracy-critical workloads.
  4. 04
    Code-reference hallucination is dominated by import paths, function signatures, and library version mismatches.65% of code-ref errors are phantom import paths or API signatures — the model invents a plausible but non-existent symbol. The fix is grounding (MCP language server tools, repo context) rather than prompting.
  5. 05
    Retrieval grounding cuts citation hallucination by 75-90% across the frontier; prompting alone cuts 5-15%.The most effective mitigation by a wide margin is grounding the model's claims in retrieved sources at generation time. Prompt-only mitigations (instruction to cite, say-I-don't-know) cut 5-15%. Retrieval grounding plus instruction cuts 75-90%. The architectural choice dwarfs the prompt-engineering choice.

01MethodologyThe 5,000-prompt test harness.

Three task families, 5,000 prompts total, run twice per model with majority labeling. Every prompt has a known ground truth.

Family 1
Factual recall
1,800 prompts · date / person / event / quantity

SimpleQA-style factual questions where the answer is a single verifiable fact. Examples: date of event, person's birthplace, capital of country, year of publication. Graded by exact-match against curated ground-truth set. Drawn from public knowledge sources with verification before inclusion.

Single-fact
Family 2
Citation accuracy
1,600 prompts · DOI / title / author / journal

Custom test where the model is asked to cite a specific paper or source. Hallucination = invented or incorrect DOI, paper title, author name, or journal. Graded against Crossref + Semantic Scholar + arXiv ground truth. Includes both confirmable and intentionally non-existent prompts.

Citation-heavy workloads
Family 3
Code reference
1,600 prompts · import / signature / API / version

Code-context prompts where the model must reference a specific library symbol — import path, function signature, parameter list, return type, library version. Graded against language-server type information and library registries (PyPI, npm, crates.io).

Code grounding
Grading rigor
Every prompt has a curated ground-truth answer. Automated grading uses string matching against canonical forms; sampled human review on 8% of runs (400 prompts) caught and corrected 1.7% of automated grades, with 0.3% remaining ambiguous. The reported error bars are ±0.5-1.2 percentage points at 95% confidence depending on cell size.

02Factual RecallFactual recall rates.

Factual recall is the cleanest task family. Every model has been optimized against SimpleQA-style benchmarks; the variance is relatively narrow (4.2-12.7%) and extended thinking has the clearest effect.

Factual recall hallucination rate · 5 models × reasoning modes

Source: Internal benchmark · 1,800 SimpleQA-style prompts · April 2026
GPT-5.5 Pro · extended thinkingOpenAI · max reasoning_effort
4.2%
Lowest
Claude Opus 4.7 · extended thinkingAnthropic · with thinking budget
5.1%
Gemini 3 Pro DT · highGoogle · Deep Think max
6.2%
GPT-5.5 · defaultStandard reasoning
8.3%
Claude Opus 4.7 · defaultWithout thinking
9.4%
DeepSeek V4 · with CoTOpen-weight + reasoning
10.4%
Grok 4.5 · defaultDefault reasoning_mode
11.2%
Gemini 3 Pro · defaultWithout Deep Think
11.9%
DeepSeek V4 · without CoTOpen-weight default
12.7%

Two patterns. First: extended thinking consistently halves the rate — GPT-5.5 Pro drops from 8.3% to 4.2%, Claude Opus 4.7 from 9.4% to 5.1%, DeepSeek V4 from 12.7% to 10.4%. The mechanism is self-correction during the reasoning trace. Second: the spread between best frontier (4.2%) and worst (12.7%) is 3×. Picking the right model is a real decision on factuality-critical workloads.

"Extended thinking is the single biggest hallucination mitigation we have in 2026. It is also the most expensive — pick the workflow before paying."— Internal eval retro, May 2026

03Citation AccuracyCitation accuracy · the worst task family.

Citation accuracy is the worst-performing task family across the frontier — average hallucination rate 12.4% even with extended thinking enabled. Models invent DOIs, paper titles, and author names with high confidence and plausible-looking specificity. The failure mode matters enormously for legal, medical, academic, and GEO workflows.

Citation hallucination rate · invented DOI, title, author, journal

Source: Internal benchmark · 1,600 citation prompts · graded against Crossref + Semantic Scholar · April 2026
GPT-5.5 Pro · extended thinkingBest citation result we measured
6.8%
Best
Claude Opus 4.7 · extended thinkingAnthropic max reasoning
7.7%
Gemini 3 Pro DT · highGoogle · Deep Think max
9.4%
GPT-5.5 · defaultStandard reasoning
12.8%
Claude Opus 4.7 · defaultWithout thinking
14.3%
DeepSeek V4 · with CoTOpen-weight + reasoning
15.7%
Grok 4.5 · defaultDefault reasoning_mode
17.2%
Gemini 3 Pro · defaultWithout Deep Think
18.1%
DeepSeek V4 · without CoTOpen-weight default
19.1%
Worst
Why citations are uniquely hard
Models have learned that citations look like a specific format (author + year + journal + DOI). When the model is uncertain, it fills the slot with plausible content rather than refusing. The hallucinations look credible — invented DOIs are well-formed, invented authors have plausible names, invented journals follow naming conventions. Detection requires lookup against the actual source databases. Prompt-engineering alone cannot fix this; only retrieval grounding can.

04Code ReferenceCode-reference accuracy rates.

Code-reference hallucination is a special category — phantom imports, invented function signatures, fabricated parameters, wrong return types. The dominant failure mode (65% of errors) is inventing a symbol that sounds plausible but does not exist in the referenced library. The mitigation is grounding via language-server tools (MCP), not prompting.

Best · GPT-5.5 Pro
3.1%
Code-reference hallucination

With extended thinking + repo context tool. Phantom imports drop to ~1%; signature errors to ~2%. The combination of reasoning + grounding is what works.

Reasoning + grounding
Mid · Claude Opus 4.7
4.6%
Code-reference hallucination

Strong on pattern recognition; modest improvement from extended thinking. Best when paired with file-tree + language-server tool calls.

Pattern + grounding
Open-weight · DeepSeek V4
9.8%
Code-reference hallucination

Higher rate than frontier closed-source on code reference. Improvement with CoT modest (15.4% → 9.8%). Reliable when used inside a structured tool harness with grounded context.

Tool harness required
Worst · Default Gemini 3 Pro
15.4%
Code-reference hallucination

Without Deep Think, code reference is the weakest task family. Library version errors and parameter-list confabulations drive the rate. Always pair with thinking_budget for code workloads.

Tier matters

05Extended Thinking EffectWhat extended thinking actually does.

The mechanism is observable in the reasoning traces. When asked to cite a paper, models with extended thinking enabled visibly reason: "I'm not certain about this DOI; let me think about whether I have actually seen this paper or am inferring from the topic." Without extended thinking, this self-correction step is skipped and the model commits to the confabulation.

Average reduction across the frontier:

  • Factual recall: −41%. 8.3% → 4.9% average across models when extended thinking is on.
  • Citation accuracy: −37%. 14.7% → 9.3% average. Smaller relative improvement; citations still need grounding.
  • Code reference: −51%. Largest relative drop; the self-correction step catches phantom imports and signature errors particularly well.

06Hallucination TopologyWhere the errors actually happen.

Aggregating the 5,000 prompts, the topology of where errors occur is consistent across models — uncertain entities, fabricated specificity, and date confusion dominate.

Pattern 1
Uncertain-entity confabulation

Model has partial knowledge of an entity (paper, person, library) and fills gaps with plausible-looking content. 38% of errors. Mitigation: retrieval grounding + explicit say-I-don't-know prompting.

38% of errors
Pattern 2
Fabricated specificity

Model invents specific numbers (DOIs, version numbers, dates, parameter counts) when only a vague concept was retrieved. 27% of errors. Mitigation: structured generation with explicit unknown markers.

27% of errors
Pattern 3
Date / version confusion

Model conflates training-cutoff facts with current state — outdated library versions, deprecated APIs, expired references. 18% of errors. Mitigation: time-aware grounding + explicit version pinning.

18% of errors
Pattern 4
Plausible-but-wrong analogy

Model applies pattern from one domain incorrectly to another (e.g., Python idiom in Rust code, US legal frame in EU context). 11% of errors. Mitigation: domain-specific system prompts + rejection sampling.

11% of errors
Pattern 5
Confident contradiction

Model contradicts ground truth with high confidence — typically on facts close to but not in training data. 6% of errors. Mitigation: red-team eval + canary fact checks + uncertainty calibration.

6% of errors

07MitigationsWhat actually reduces hallucination.

The hierarchy of mitigations from most to least effective, measured against the same 5,000-prompt suite. Stack mitigations rather than picking one — combinations compound.

Mitigation effectiveness · % reduction in hallucination rate

Source: Internal benchmark · same 5,000-prompt suite · April 2026
Retrieval grounding (RAG)Citation context retrieved at generation time
−75-90%
Strongest single lever
Tool grounding (MCP language server, web fetch)Code, symbol, fact lookup tool calls
−65-80%
Extended thinking (high reasoning_effort)Self-correction during reasoning trace
−30-60%
Multi-sample verification (n=3, majority + abstain)Three runs, abstain on disagreement
−25-50%
Structured output with unknown markersJSON schema with required confidence + source fields
−15-35%
Say-I-don't-know promptingExplicit instruction in system prompt
−5-15%
Temperature lowering (0 vs 0.7)Reduces variance, not hallucination root cause
−2-8%
The architectural lever beats the prompt lever
The single most effective mitigation is retrieval grounding (−75-90%). The next most effective is tool grounding (−65-80%). Both are architectural decisions, not prompt-engineering decisions. Prompt-only mitigations cap out at −15%. The implication for accuracy-critical workflows is that the prompt is the wrong investment surface — build retrieval grounding first, then tune everything else.

08ConclusionHallucination is architecturally mitigated, not prompt-tuned.

Hallucination landscape · April 2026

Pick the model. Pick the reasoning tier. Build retrieval grounding. Verify.

Hallucination rates in 2026 are 3-8× lower than 2024 baselines and still measurably non-zero. The 4-19% range across the frontier is wide enough that model choice matters, but reasoning tier and retrieval architecture matter more. Most teams under-invest in both.

For accuracy-critical workloads — legal, medical, GEO citation, regulated content — the right stack is GPT-5.5 Pro or Claude Opus 4.7 with extended thinking, retrieval grounding against the actual source database, and human-in-the-loop verification on a sampled share. None of those layers is optional; together they bring hallucination from 19% to under 1%, which is the bar most production workflows actually need.

We re-run this benchmark every quarter. Bookmark this page if you want the canonical reference; subscribe to the newsletter if you want the change log delivered.

Hallucination engineering for accuracy-critical workloads

Stop trusting the prompt. Build for architectural accuracy.

We design hallucination-aware AI deployments for legal, medical, GEO, and regulated content teams — covering model selection, retrieval grounding, tool harness design, and verification telemetry.

Free consultationExpert guidanceTailored solutions
What we work on

Hallucination-mitigation engagements

  • Model + reasoning tier selection by task family
  • Retrieval grounding architecture (RAG + tool harness)
  • MCP language-server tool integration for code accuracy
  • Verification harness — automated + human sampled review
  • Hallucination telemetry as a first-class production metric
FAQ · AI hallucination benchmarks 2026

The questions we get every week.

A hallucination is a factual claim emitted by the model that is contradicted by ground-truth data — an invented DOI, a wrong birth date, a phantom function signature, an outdated library version. We do not count refusals or 'I don't know' responses as hallucinations; those are correct uncertainty signals. We do count plausible-looking but factually wrong outputs as hallucinations even when the model expresses high confidence. Grading is automated against canonical ground truth with sampled human review on 8% of runs to catch grading edge cases.