Multimodal evaluation has moved past the pure image-QA era. By April 2026 the four leading frontier multimodal models — GPT-5.5, Gemini 3 Deep Think, Claude Opus 4.7, and Qwen 3.5 Omni — all clear 80% on MMMU-Pro. That benchmark, which split the field by 10+ points in 2024, now spreads them by under 3 points. The differentiating axis has moved.

The new axes are video understanding (where Gemini 3 dominates), audio comprehension and ASR-plus-reasoning (where Gemini 3 again leads, with Qwen 3.5 Omni close behind on real-time applications), long-document OCR (where Claude Opus 4.7 holds the crown), chart reasoning and infographics (where GPT-5.5 leads), and code-with-vision (where GPT-5.5's longer reasoning traces shine).

This 80+ data cell matrix covers the eight modal capabilities that drive 2026 deployment decisions. Use it to pick the right multimodal model per workload — and to know when to switch between them.

Key takeaways

01
MMMU-Pro is saturated — the headline image-QA benchmark no longer differentiates frontier models.GPT-5.5, Gemini 3, Claude Opus 4.7, and Qwen 3.5 Omni all score 81-83% on MMMU-Pro in Apr 2026. The 2024 field was 65-78%; the 2026 spread is 2.4 points. Don't pick a multimodal model on MMMU-Pro alone — it tells you everyone is good, not who wins.
02
Gemini 3 Deep Think is the video leader by a wide margin in 2026.Video-MME (long-form): Gemini 3 78.4%, GPT-5.5 71.2%, Claude Opus 4.7 67.8%, Qwen 3.5 Omni 69.5%. The gap is largest on multi-clip reasoning and temporal understanding. For any workload involving video — content moderation, video summarization, sports analysis — Gemini 3 is the default.
03
Claude Opus 4.7 owns long-document OCR; the gap widens with document length.DocVQA: Opus 4.7 93.0%, GPT-5.5 91.5%, Gemini 3 90.8%, Qwen 3.5 Omni 87.9%. The gap is small on standard DocVQA but widens to 5-8 points on the long-document split (50+ page PDFs). Opus 4.7's 1M context combined with strong vision makes it the clear choice for legal, contract, and technical-documentation workflows.
04
GPT-5.5 leads on chart reasoning, infographic understanding, and code-with-vision.ChartQA 92.1% (vs 89.4% Gemini 3, 88.0% Opus 4.7), DocVQA-Code 71.3% (Gemini 3 64.1%), AI2D 96.2%. GPT-5.5's strong code reasoning carries over into vision tasks involving structured visuals — analytics dashboards, code screenshots, technical diagrams.
05
Qwen 3.5 Omni is the real-time leader; ASR + audio reasoning at sub-300ms first-token.On real-time audio (ASR + immediate reasoning), Qwen 3.5 Omni hits sub-300ms time-to-first-token at 95%+ ASR accuracy. Gemini 3 has higher offline ASR quality but slower real-time response. For voice agents, customer-service bots, and accessibility applications, Qwen Omni is the default.

01 — The ShiftFrom image-QA to multi-axis multimodal.

In 2023-2024, multimodal evaluation was effectively MMMU and MMMU-Pro — image understanding plus QA. The field spread cleanly on those benchmarks because most models were genuinely worse at them. By 2026, MMMU-Pro is saturated; the meaningful frontier has moved to video, long-document OCR, audio understanding, and the edge cases of chart and code-with-vision reasoning.

The shift is reminiscent of pure-text 2023 — when MMLU saturated, the field moved to GSM8K, then to MATH, then to FrontierMath. The benchmark progression always tracks the capability frontier, with about a one-year lag.

MMMU-Pro saturation vs Video-MME differentiation

Source: Public model cards · Artificial Analysis · Apr 2026

MMMU-Pro · GPT-5.5Standard image-QA · saturated benchmark

82.8%

MMMU-Pro · Gemini 3 Deep ThinkSaturated · within 1 pt of leader

82.1%

MMMU-Pro · Claude Opus 4.7Saturated · 2 pts behind leader

81.4%

MMMU-Pro · Qwen 3.5 OmniOpen weight · narrowest gap to closed

81.0%

Video-MME · Gemini 3Differentiated benchmark · clear leader

78.4%

−7 pts to next

Video-MME · GPT-5.5Distant second on video

71.2%

The contrast is the story. MMMU-Pro's spread has dropped to 2.4 points; Video-MME's spread is 11 points. Production teams should treat the benchmark hierarchy accordingly: ignore MMMU-Pro as a differentiator, weight Video-MME and the long-doc / audio / chart benchmarks heavily.

02 — Image & DocumentsImage and long-document OCR.

The image-and-document axis covers MMMU-Pro (saturated), DocVQA (long-document OCR), AI2D (diagrams), and the chart reasoning benchmarks. Claude Opus 4.7 leads on long-document OCR thanks to its 1M context combined with strong vision; GPT-5.5 leads on chart and diagram reasoning; everyone else is close on standard DocVQA but falls off on the 50+ page split.

MMMU-Pro

Standard college-level image QA

82.8% GPT-5.5 · 82.1% Gemini 3 · 81.4% Opus · 81.0% Qwen Omni

Saturated. Don't differentiate frontier models on this benchmark; the spread is within run-to-run noise.

Saturated

DocVQA

Long-document OCR + reasoning

93.0% Opus 4.7 · 91.5% GPT-5.5 · 90.8% Gemini 3 · 87.9% Qwen Omni

Long-document OCR is Claude's territory. The gap widens on the 50+ page split — Opus 4.7's 1M context combined with strong vision is the production default for legal, contract, and technical documentation work.

Opus 4.7 leader

ChartQA

Chart and infographic reasoning

92.1% GPT-5.5 · 89.4% Gemini 3 · 88.0% Opus · 87.2% Qwen Omni

GPT-5.5 leads. Charts, dashboards, infographics — the structured-visual category that maps onto code reasoning. Right call for any workload involving analytics, BI tools, or financial reporting.

GPT-5.5 leader

AI2D

Science diagrams

96.2% GPT-5.5 · 95.4% Gemini 3 · 94.8% Opus · 93.0% Qwen Omni

Saturated at the top end. AI2D is now nearly maxed across frontier models — useful for educational and technical-illustration workloads but not a differentiator.

Near-saturated

03 — VideoVideo understanding — Gemini 3 dominates.

Video understanding is the differentiated axis. Gemini 3 Deep Think holds 78.4% on Video-MME long-form, with a 7-point gap to second place (GPT-5.5 at 71.2%). The gap is largest on multi-clip reasoning, temporal understanding, and tasks that require integrating across long video sequences. For any video workload in 2026, Gemini 3 is the default.

Video-MME

78.4%

Gemini 3 Deep Think · long-form video

Best in class on long-form video understanding. The 7-point lead over GPT-5.5 widens to 12 points on multi-clip / temporal reasoning sub-splits. Gemini 3's video-native training and Vertex AI multimodal infrastructure are the differentiators.

Gemini 3 leader

Video-MME

71.2%

GPT-5.5 · second on video

Acceptable for short-form video tasks (sub-2-minute clips, single-scene reasoning). Falls off on long-form and multi-clip reasoning. OpenAI's video-native training has lagged Google's; expect this gap to close in 2026 H2.

Short-form acceptable

Video-MME

67.8%

Claude Opus 4.7 · third

Vision-strong but video-trained later. 67.8% on Video-MME is competitive on short-form scene understanding but lags on temporal reasoning. Anthropic's video-evaluation work in early 2026 hints at improvement; not yet caught up.

Catching up

"For any workload that touches video — moderation, summarization, sports clip analysis, training-data extraction — Gemini 3 is the default. The gap is real and persistent."— Internal multimodal-eval notes, May 2026

04 — AudioAudio comprehension and real-time ASR.

Audio is the second-most-differentiated axis after video. Two sub-categories matter: offline audio understanding (long-form listening, podcast summarization, lecture comprehension) and real-time ASR + reasoning (voice agents, customer service, accessibility). Gemini 3 leads on offline; Qwen 3.5 Omni leads on real-time.

Capability

Offline audio understanding (combined ASR + reasoning)

Gemini 3 84.7%, Qwen 3.5 Omni 81.2%, GPT-5.5 79.8%, Claude Opus 4.7 77.4% on the combined ASR-plus-reasoning benchmark. Gemini 3's audio-native training pulls ahead. Right default for offline workloads — podcast summarization, lecture analysis, audio-content moderation.

Gemini 3 · offline

Capability

Real-time voice (sub-300ms first-token + ASR)

Qwen 3.5 Omni hits 95%+ ASR with sub-300ms time-to-first-token in the audio mode. Gemini 3 is faster offline but slower real-time. GPT-5.5 has a real-time mode in the API. Right default for voice agents, customer service bots, accessibility apps.

Qwen 3.5 Omni · real-time

Capability

Pure ASR (transcript quality only)

Whisper-class specialty models (OpenAI Whisper-3, NVIDIA Parakeet) still lead pure transcription accuracy. The frontier multimodal models trade some ASR accuracy for combined reasoning. Use a pure-ASR model for transcription-only workflows; use multimodal for transcript-plus-action.

Whisper / Parakeet specialty

Capability

Multilingual audio + reasoning

Qwen 3.5 Omni's multilingual coverage (40+ languages with native ASR) is broadest in the frontier set. Gemini 3 covers 30+ at strong quality. GPT-5.5 strong in English-Chinese-Spanish-Japanese; weaker in long-tail languages. Default to Qwen Omni for multilingual.

Qwen 3.5 Omni · multilingual

05 — Code-with-VisionCode with vision — GPT-5.5 leads.

Code-with-vision is the newest evaluation category — tasks where the model must reason about code shown as a screenshot, IDE window, or terminal output, then produce a corrected or extended code response. DocVQA-Code and the SWE-Bench-Vision split measure this; GPT-5.5 leads both, by margins that mirror its chart-reasoning lead.

Code-with-vision benchmarks · 4-model field

Source: Internal Apr 2026 evals · public model cards · DocVQA-Code

DocVQA-Code · GPT-5.5Code shown as screenshot · reason about it

71.3%

leader

DocVQA-Code · Gemini 3Strong on diagrams, weaker on code screenshots

64.1%

DocVQA-Code · Claude Opus 4.7Vision-acceptable, code-strong, mid-pack on combined

61.4%

DocVQA-Code · Qwen 3.5 OmniOpen-weight reference

54.0%

SWE-Bench-Vision · GPT-5.5Vision-augmented coding tasks

56.4%

SWE-Bench-Vision · Opus 4.7Strong on text-only SWE; weaker with vision

50.7%

GPT-5.5's lead here mirrors its strength on charts and structured visuals. Code is structured visual content; screenshot reasoning is closer to chart reasoning than to natural-image understanding. The pattern is consistent across tasks involving structured 2D content (charts, code, spreadsheet screenshots).

06 — DecisionPicking by modality.

Most production multimodal deployments end up using two or three models across modalities. The pattern: pick the leader per modality and route by request type. The cost of multi-model routing is small; the quality lift on each modality is substantial.

Workload

Long-document OCR + extraction

Legal contracts, technical PDFs, financial statements, research papers. Claude Opus 4.7 wins on DocVQA long-document split and offers 1M context for full-document reasoning. Default choice.

Claude Opus 4.7

Workload

Video understanding · any kind

Content moderation, video summarization, sports clip analysis, educational video processing. Gemini 3 Deep Think wins by 7+ points on Video-MME. No close second.

Gemini 3 Deep Think

Workload

Chart, dashboard, infographic reasoning

Analytics dashboard reading, financial chart analysis, infographic Q&A. GPT-5.5 wins ChartQA and DocVQA-Code. Pairs well with code-completion workflows.

GPT-5.5

Workload

Real-time voice agent / customer service

Sub-300ms time-to-first-token with high-quality ASR + immediate reasoning. Qwen 3.5 Omni wins on real-time; covers 40+ languages natively. Pair with a higher-quality offline model for transcript review.

Qwen 3.5 Omni

Workload

General multimodal app · single-model default

If forced to a single model: GPT-5.5 is the broadest performer (strong on chart, code, image, acceptable on video and audio). Claude Opus 4.7 is the second choice if document-heavy. Gemini 3 third if video-heavy.

GPT-5.5 default

07 — ConclusionPick by modality, not headline benchmark.

Multimodal benchmark map, April 2026

The era of single-model multimodal is over.

By April 2026 the multimodal frontier is differentiated enough that picking on aggregate benchmark scores misses the real decision. Every frontier model is good at standard image-QA; none is best at every modality. Production deployments that route by modality — Claude for documents, Gemini for video, GPT-5.5 for charts and code-with-vision, Qwen Omni for real-time voice — outperform single-model deployments by meaningful margins on each capability axis.

The benchmark progression has lagged the capability progression by about a year, as it always has. MMMU-Pro saturating in 2026 is the equivalent of MMLU saturating in 2024; the field has moved to harder benchmarks, and the harder benchmarks (Video-MME, DocVQA long-document split, real-time audio benchmarks) are where the meaningful differentiation lives now.

For agency and product teams, the practical takeaway is to stop evaluating multimodal models as a single capability and start evaluating them per-modality, with workload-specific evals on the modalities that actually matter for the deployment. The single-model multimodal era is over; the routed-multi-model era is the production reality.

Multimodal AI Benchmarks 2026: Vision, Audio, Code

01 — The ShiftFrom image-QA to multi-axis multimodal.

MMMU-Pro saturation vs Video-MME differentiation

02 — Image & DocumentsImage and long-document OCR.

Standard college-level image QA

Long-document OCR + reasoning

Chart and infographic reasoning

Science diagrams

03 — VideoVideo understanding — Gemini 3 dominates.

Gemini 3 Deep Think · long-form video

GPT-5.5 · second on video

Claude Opus 4.7 · third

04 — AudioAudio comprehension and real-time ASR.

Offline audio understanding (combined ASR + reasoning)

Real-time voice (sub-300ms first-token + ASR)

Pure ASR (transcript quality only)

Multilingual audio + reasoning

05 — Code-with-VisionCode with vision — GPT-5.5 leads.

Code-with-vision benchmarks · 4-model field

06 — DecisionPicking by modality.

Long-document OCR + extraction

Video understanding · any kind

Chart, dashboard, infographic reasoning

Real-time voice agent / customer service

General multimodal app · single-model default

07 — ConclusionPick by modality, not headline benchmark.

The era of single-model multimodal is over.

Move past single-model thinking. Pick by modality.

Multimodal engagements

The questions we get every week.

Continue exploring multimodal AI capability.

Long-Context Retrieval 2026: Needle-in-Haystack Test

Reasoning Effort: Cost vs Quality Benchmarks 2026

AI Hallucination Rate Benchmarks 2026: 5-Model Study