Multimodal evaluation has moved past the pure image-QA era. By April 2026 the four leading frontier multimodal models — GPT-5.5, Gemini 3 Deep Think, Claude Opus 4.7, and Qwen 3.5 Omni — all clear 80% on MMMU-Pro. That benchmark, which split the field by 10+ points in 2024, now spreads them by under 3 points. The differentiating axis has moved.
The new axes are video understanding (where Gemini 3 dominates), audio comprehension and ASR-plus-reasoning (where Gemini 3 again leads, with Qwen 3.5 Omni close behind on real-time applications), long-document OCR (where Claude Opus 4.7 holds the crown), chart reasoning and infographics (where GPT-5.5 leads), and code-with-vision (where GPT-5.5's longer reasoning traces shine).
This 80+ data cell matrix covers the eight modal capabilities that drive 2026 deployment decisions. Use it to pick the right multimodal model per workload — and to know when to switch between them.
- 01MMMU-Pro is saturated — the headline image-QA benchmark no longer differentiates frontier models.GPT-5.5, Gemini 3, Claude Opus 4.7, and Qwen 3.5 Omni all score 81-83% on MMMU-Pro in Apr 2026. The 2024 field was 65-78%; the 2026 spread is 2.4 points. Don't pick a multimodal model on MMMU-Pro alone — it tells you everyone is good, not who wins.
- 02Gemini 3 Deep Think is the video leader by a wide margin in 2026.Video-MME (long-form): Gemini 3 78.4%, GPT-5.5 71.2%, Claude Opus 4.7 67.8%, Qwen 3.5 Omni 69.5%. The gap is largest on multi-clip reasoning and temporal understanding. For any workload involving video — content moderation, video summarization, sports analysis — Gemini 3 is the default.
- 03Claude Opus 4.7 owns long-document OCR; the gap widens with document length.DocVQA: Opus 4.7 93.0%, GPT-5.5 91.5%, Gemini 3 90.8%, Qwen 3.5 Omni 87.9%. The gap is small on standard DocVQA but widens to 5-8 points on the long-document split (50+ page PDFs). Opus 4.7's 1M context combined with strong vision makes it the clear choice for legal, contract, and technical-documentation workflows.
- 04GPT-5.5 leads on chart reasoning, infographic understanding, and code-with-vision.ChartQA 92.1% (vs 89.4% Gemini 3, 88.0% Opus 4.7), DocVQA-Code 71.3% (Gemini 3 64.1%), AI2D 96.2%. GPT-5.5's strong code reasoning carries over into vision tasks involving structured visuals — analytics dashboards, code screenshots, technical diagrams.
- 05Qwen 3.5 Omni is the real-time leader; ASR + audio reasoning at sub-300ms first-token.On real-time audio (ASR + immediate reasoning), Qwen 3.5 Omni hits sub-300ms time-to-first-token at 95%+ ASR accuracy. Gemini 3 has higher offline ASR quality but slower real-time response. For voice agents, customer-service bots, and accessibility applications, Qwen Omni is the default.
01 — The ShiftFrom image-QA to multi-axis multimodal.
In 2023-2024, multimodal evaluation was effectively MMMU and MMMU-Pro — image understanding plus QA. The field spread cleanly on those benchmarks because most models were genuinely worse at them. By 2026, MMMU-Pro is saturated; the meaningful frontier has moved to video, long-document OCR, audio understanding, and the edge cases of chart and code-with-vision reasoning.
The shift is reminiscent of pure-text 2023 — when MMLU saturated, the field moved to GSM8K, then to MATH, then to FrontierMath. The benchmark progression always tracks the capability frontier, with about a one-year lag.
MMMU-Pro saturation vs Video-MME differentiation
Source: Public model cards · Artificial Analysis · Apr 2026The contrast is the story. MMMU-Pro's spread has dropped to 2.4 points; Video-MME's spread is 11 points. Production teams should treat the benchmark hierarchy accordingly: ignore MMMU-Pro as a differentiator, weight Video-MME and the long-doc / audio / chart benchmarks heavily.
02 — Image & DocumentsImage and long-document OCR.
The image-and-document axis covers MMMU-Pro (saturated), DocVQA (long-document OCR), AI2D (diagrams), and the chart reasoning benchmarks. Claude Opus 4.7 leads on long-document OCR thanks to its 1M context combined with strong vision; GPT-5.5 leads on chart and diagram reasoning; everyone else is close on standard DocVQA but falls off on the 50+ page split.
Standard college-level image QA
82.8% GPT-5.5 · 82.1% Gemini 3 · 81.4% Opus · 81.0% Qwen OmniSaturated. Don't differentiate frontier models on this benchmark; the spread is within run-to-run noise.
SaturatedLong-document OCR + reasoning
93.0% Opus 4.7 · 91.5% GPT-5.5 · 90.8% Gemini 3 · 87.9% Qwen OmniLong-document OCR is Claude's territory. The gap widens on the 50+ page split — Opus 4.7's 1M context combined with strong vision is the production default for legal, contract, and technical documentation work.
Opus 4.7 leaderChart and infographic reasoning
92.1% GPT-5.5 · 89.4% Gemini 3 · 88.0% Opus · 87.2% Qwen OmniGPT-5.5 leads. Charts, dashboards, infographics — the structured-visual category that maps onto code reasoning. Right call for any workload involving analytics, BI tools, or financial reporting.
GPT-5.5 leaderScience diagrams
96.2% GPT-5.5 · 95.4% Gemini 3 · 94.8% Opus · 93.0% Qwen OmniSaturated at the top end. AI2D is now nearly maxed across frontier models — useful for educational and technical-illustration workloads but not a differentiator.
Near-saturated03 — VideoVideo understanding — Gemini 3 dominates.
Video understanding is the differentiated axis. Gemini 3 Deep Think holds 78.4% on Video-MME long-form, with a 7-point gap to second place (GPT-5.5 at 71.2%). The gap is largest on multi-clip reasoning, temporal understanding, and tasks that require integrating across long video sequences. For any video workload in 2026, Gemini 3 is the default.
Gemini 3 Deep Think · long-form video
Best in class on long-form video understanding. The 7-point lead over GPT-5.5 widens to 12 points on multi-clip / temporal reasoning sub-splits. Gemini 3's video-native training and Vertex AI multimodal infrastructure are the differentiators.
Gemini 3 leaderGPT-5.5 · second on video
Acceptable for short-form video tasks (sub-2-minute clips, single-scene reasoning). Falls off on long-form and multi-clip reasoning. OpenAI's video-native training has lagged Google's; expect this gap to close in 2026 H2.
Short-form acceptableClaude Opus 4.7 · third
Vision-strong but video-trained later. 67.8% on Video-MME is competitive on short-form scene understanding but lags on temporal reasoning. Anthropic's video-evaluation work in early 2026 hints at improvement; not yet caught up.
Catching up"For any workload that touches video — moderation, summarization, sports clip analysis, training-data extraction — Gemini 3 is the default. The gap is real and persistent."— Internal multimodal-eval notes, May 2026
04 — AudioAudio comprehension and real-time ASR.
Audio is the second-most-differentiated axis after video. Two sub-categories matter: offline audio understanding (long-form listening, podcast summarization, lecture comprehension) and real-time ASR + reasoning (voice agents, customer service, accessibility). Gemini 3 leads on offline; Qwen 3.5 Omni leads on real-time.
Offline audio understanding (combined ASR + reasoning)
Gemini 3 84.7%, Qwen 3.5 Omni 81.2%, GPT-5.5 79.8%, Claude Opus 4.7 77.4% on the combined ASR-plus-reasoning benchmark. Gemini 3's audio-native training pulls ahead. Right default for offline workloads — podcast summarization, lecture analysis, audio-content moderation.
Gemini 3 · offlineReal-time voice (sub-300ms first-token + ASR)
Qwen 3.5 Omni hits 95%+ ASR with sub-300ms time-to-first-token in the audio mode. Gemini 3 is faster offline but slower real-time. GPT-5.5 has a real-time mode in the API. Right default for voice agents, customer service bots, accessibility apps.
Qwen 3.5 Omni · real-timePure ASR (transcript quality only)
Whisper-class specialty models (OpenAI Whisper-3, NVIDIA Parakeet) still lead pure transcription accuracy. The frontier multimodal models trade some ASR accuracy for combined reasoning. Use a pure-ASR model for transcription-only workflows; use multimodal for transcript-plus-action.
Whisper / Parakeet specialtyMultilingual audio + reasoning
Qwen 3.5 Omni's multilingual coverage (40+ languages with native ASR) is broadest in the frontier set. Gemini 3 covers 30+ at strong quality. GPT-5.5 strong in English-Chinese-Spanish-Japanese; weaker in long-tail languages. Default to Qwen Omni for multilingual.
Qwen 3.5 Omni · multilingual05 — Code-with-VisionCode with vision — GPT-5.5 leads.
Code-with-vision is the newest evaluation category — tasks where the model must reason about code shown as a screenshot, IDE window, or terminal output, then produce a corrected or extended code response. DocVQA-Code and the SWE-Bench-Vision split measure this; GPT-5.5 leads both, by margins that mirror its chart-reasoning lead.
Code-with-vision benchmarks · 4-model field
Source: Internal Apr 2026 evals · public model cards · DocVQA-CodeGPT-5.5's lead here mirrors its strength on charts and structured visuals. Code is structured visual content; screenshot reasoning is closer to chart reasoning than to natural-image understanding. The pattern is consistent across tasks involving structured 2D content (charts, code, spreadsheet screenshots).
06 — DecisionPicking by modality.
Most production multimodal deployments end up using two or three models across modalities. The pattern: pick the leader per modality and route by request type. The cost of multi-model routing is small; the quality lift on each modality is substantial.
Long-document OCR + extraction
Legal contracts, technical PDFs, financial statements, research papers. Claude Opus 4.7 wins on DocVQA long-document split and offers 1M context for full-document reasoning. Default choice.
Claude Opus 4.7Video understanding · any kind
Content moderation, video summarization, sports clip analysis, educational video processing. Gemini 3 Deep Think wins by 7+ points on Video-MME. No close second.
Gemini 3 Deep ThinkChart, dashboard, infographic reasoning
Analytics dashboard reading, financial chart analysis, infographic Q&A. GPT-5.5 wins ChartQA and DocVQA-Code. Pairs well with code-completion workflows.
GPT-5.5Real-time voice agent / customer service
Sub-300ms time-to-first-token with high-quality ASR + immediate reasoning. Qwen 3.5 Omni wins on real-time; covers 40+ languages natively. Pair with a higher-quality offline model for transcript review.
Qwen 3.5 OmniGeneral multimodal app · single-model default
If forced to a single model: GPT-5.5 is the broadest performer (strong on chart, code, image, acceptable on video and audio). Claude Opus 4.7 is the second choice if document-heavy. Gemini 3 third if video-heavy.
GPT-5.5 default07 — ConclusionPick by modality, not headline benchmark.
The era of single-model multimodal is over.
By April 2026 the multimodal frontier is differentiated enough that picking on aggregate benchmark scores misses the real decision. Every frontier model is good at standard image-QA; none is best at every modality. Production deployments that route by modality — Claude for documents, Gemini for video, GPT-5.5 for charts and code-with-vision, Qwen Omni for real-time voice — outperform single-model deployments by meaningful margins on each capability axis.
The benchmark progression has lagged the capability progression by about a year, as it always has. MMMU-Pro saturating in 2026 is the equivalent of MMLU saturating in 2024; the field has moved to harder benchmarks, and the harder benchmarks (Video-MME, DocVQA long-document split, real-time audio benchmarks) are where the meaningful differentiation lives now.
For agency and product teams, the practical takeaway is to stop evaluating multimodal models as a single capability and start evaluating them per-modality, with workload-specific evals on the modalities that actually matter for the deployment. The single-model multimodal era is over; the routed-multi-model era is the production reality.