AI DevelopmentDecision Matrix13 min readPublished May 28, 2026

An honest decision guide for the May 2026 frontier.

Claude Opus 4.8 vs GPT-5.5: benchmarks, pricing, and which to pick.

Both labs now field 1M-token flagships with agentic coding as their primary pitch. Here is where each model wins, where each loses, and which to pick for each production workload — including the pricing asymmetry that changes the economics at scale.

DA
Digital Applied Team
Senior strategists · Published May 28, 2026
PublishedMay 28, 2026
Read time13 min
Sources3 official
SWE-bench Pro
58.6/69.2%
GPT-5.5 / Opus 4.8
Terminal-Bench 2.1
78.2/74.6%
GPT-5.5 / Opus 4.8
Pricing (in / out per 1M)
$5/$25
Opus 4.8 flat — no surcharge
GraphWalks 1M (BFS)
45.4/68.1%
GPT-5.5 / Opus 4.8

Anthropic shipped Claude Opus 4.8 on May 28, 2026 — the same day as this guide — posting 69.2% on SWE-bench Pro and 88.6% on SWE-bench Verified, according to the Anthropic system card. GPT-5.5, the current OpenAI flagship, remains the strongest competitor on terminal-centric coding benchmarks and for workloads that stay under 272K input tokens. This guide is a head-to-head decision matrix: where each model leads, where the margins are narrow, and how the flat-versus-surcharge pricing structure changes the economics at scale.

The stakes are concrete. Both models are positioned as the production choice for agentic coding, long-context reasoning, and complex tool use — the highest-value commercial AI workloads of mid-2026. Teams routing incorrectly by benchmark headline rather than workload shape can leave meaningful performance and cost on the table. For context on the Opus 4.8 launch itself, see our Opus 4.8 release and dynamic workflows guide; for the prior-generation matchup, see GPT-5.5 vs Claude Opus 4.7.

All benchmark numbers in this post are sourced directly from the Opus 4.8 system card and published third-party leaderboards. Where a head-to-head number was not available (e.g., GPT-5.5 on SWE-bench Verified), this guide notes the absence rather than fabricating a figure. This is a decision guide, not a marketing sheet.

Key takeaways
  1. 01
    Opus 4.8 leads SWE-bench Pro by 10.6 points.According to the Anthropic system card, Opus 4.8 scores 69.2% on SWE-bench Pro versus 58.6% for GPT-5.5 — a meaningful gap on codebase-resolution tasks. Opus 4.8 also posts 88.6% on SWE-bench Verified, though the card did not publish a comparable GPT-5.5 figure on that harness.
  2. 02
    GPT-5.5 wins Terminal-Bench 2.1 — and harness choice matters.Benchmarks suggest GPT-5.5 scores 78.2% on Terminal-Bench 2.1 versus 74.6% for Opus 4.8 run at high effort via the Terminus-2 public harness. GPT-5.5 reaches 83.4% under its own Codex CLI harness. The gap is real; it is also harness-dependent. Terminal-centric and latency-sensitive pipelines favour GPT-5.5.
  3. 03
    Opus 4.8 is flat-priced; GPT-5.5 has a long-context surcharge.Opus 4.8 charges $5/$25 per million input/output tokens regardless of context length, up to its 1M-token window. GPT-5.5 is $5/$30 under 272K input tokens, but a long-context surcharge applies above that threshold — roughly 2× input and 1.5× output for the whole session. For frequent use of 272K+ contexts, Opus 4.8 is the lower-cost model despite GPT-5.5's cheaper short-context output rate.
  4. 04
    Opus 4.8 leads long-context retrieval by a large margin.On GraphWalks long-context F1, Opus 4.8 leads GPT-5.5 by 12.2 points at 256K (BFS), 22.7 points at 1M (BFS), and 24.8 points at 1M (Parents). These gaps are large enough to be architecturally decisive for workloads that routinely reason over entire codebases or multi-document corpora.
  5. 05
    Most benchmarks are single-digit; routing by task type is smarter than picking one model.On reasoning (HLE, ArXivMath) and finance (Finance Agent v2), the margins are narrow enough that task shape, ecosystem fit, and economics should drive the decision — not headline benchmark deltas.

01Release SnapshotTwo flagships, both at 1M context, both agentic-first.

Before the benchmarks, the structural profile. Both models carry 1M-token context windows and are marketed as the best option for agentic coding from their respective labs. The meaningful structural differences sit in three places: the effort/reasoning model (Opus 4.8 defaults to high, with extra/xhigh/max selectable; GPT-5.5 uses a Thinking default plus a Pro tier), the pricing model (Opus 4.8 flat; GPT-5.5 has a surcharge above 272K input), and the cloud distribution (Opus 4.8 ships GA on Anthropic API, Bedrock, Vertex, and Foundry on day one; GPT-5.5 is live in ChatGPT and Codex, with the API rolling out).

GPT-5.5 · OpenAI

Current GPT flagship

API IDs: gpt-5.5 / gpt-5.5-pro
1M-token context; Thinking mode default in ChatGPT. Pricing: $5/$30 per 1M input/output under 272K tokens — long-context surcharge applies above that threshold (~2× input, 1.5× output). Pro tier: $30/$180. Leads Terminal-Bench 2.1 under Codex CLI harness (83.4%); competitive on ArXivMath.
Claude Opus 4.8 · Anthropic

Shipped May 28, 2026

API ID: claude-opus-4-8
1M-token context; flat $5/$25 pricing regardless of context length. Fast mode at $10/$50 for 2.5× speed. Defaults to high effort; extra/xhigh/max selectable. Leads SWE-bench Pro (69.2%), OSWorld-Verified (83.4%), MCP-Atlas (82.2%), and all GraphWalks long-context retrievals.
Spec sheet

Side-by-side specification

SpecGPT-5.5Claude Opus 4.8
Ship dateApril 23, 2026May 28, 2026
API model IDgpt-5.5 / gpt-5.5-proclaude-opus-4-8
Context window1M tokens1M tokens (flat pricing)
Pricing — in / out per 1M$5 / $30 (under 272K); surcharge above$5 / $25 — flat, no surcharge
Premium tierGPT-5.5 Pro — $30 / $180Fast mode — $10 / $50 (2.5× speed)
Effort / reasoningThinking (default), ProHigh (default); extra / xhigh / max
Cloud availability (GA)OpenAI API (rolling), ChatGPT, CodexAPI + Bedrock + Vertex AI + Foundry
SWE-bench Pro58.6%69.2%
Terminal-Bench 2.178.2% (Terminus-2) / 83.4% (Codex CLI)74.6% (high effort, Terminus-2)

02Coding & AgentsSWE-bench, Terminal-Bench, and the harness story.

Coding and agentic evaluation is the most contested category — and the one where the harness choice most visibly affects the results. Opus 4.8 leads SWE-bench Pro (69.2% vs 58.6%), the benchmark that tests resolving real GitHub issues. It also leads OSWorld-Verified computer use (83.4% vs 78.7%) and MCP-Atlas tool use (82.2% vs 75.3%). GPT-5.5 leads Terminal-Bench 2.1 under its native Codex CLI harness (83.4%); when both models are run on the Terminus-2 public harness, GPT-5.5 scores 78.2% and Opus 4.8 scores 74.6% — a smaller but still real gap that reflects genuine strength on terminal-centric, latency-sensitive pipelines.

The harness caveat matters here in a way that affects procurement decisions. GPT-5.5's 83.4% Terminal-Bench figure is on the Codex CLI harness — Anthropic's own harness for Opus 4.8 may produce different numbers on the same tasks, just as GPT-5.5's Terminus-2 figure (78.2%) differs from its Codex CLI result. The gap is real; its absolute magnitude depends on your evaluation environment. Teams considering Terminal-Bench performance as a primary signal should test both models on their own actual pipelines before committing.

On SWE-bench Verified, Opus 4.8 scores 88.6% per the system card — a strong result. The card did not publish a GPT-5.5 figure on this harness. AutomationBench (Zapier integrations) shows both models at modest levels — Opus 4.8 at 15.5%, GPT-5.5 at 12.9% — indicating that complex cross-tool automation remains a frontier challenge for either model.

A partner assessment cited on the Opus 4.8 announcement page noted that on their Super-Agent benchmark, Opus 4.8 was reportedly the only model to complete every case end-to-end, described as beating prior Opus models and GPT-5.5 at parity on cost — though this is a proprietary, non-public benchmark and should be treated as an indicative signal, not a third-party verifiable score.

Coding & agentic benchmarks

GPT-5.5Opus 4.8
SWE-bench Pro+10.6 · Opus 4.8
GPT-5.5
58.6%
Opus 4.8
69.2%
SWE-bench VerifiedOpus 4.8 only
GPT-5.5
Not published
Opus 4.8
88.6%
Terminal-Bench 2.1 (Terminus-2 harness)+3.6 · GPT-5.5
GPT-5.5
78.2%
Opus 4.8
74.6%
OSWorld-Verified (computer use)+4.7 · Opus 4.8
GPT-5.5
78.7%
Opus 4.8
83.4%
MCP-Atlas (tool use)+6.9 · Opus 4.8
GPT-5.5
75.3%
Opus 4.8
82.2%
AutomationBench (Zapier)+2.6 · Opus 4.8
GPT-5.5
12.9%
Opus 4.8
15.5%
Coding verdictOpus 4.8 leads codebase-resolution evals (SWE-bench Pro +10.6 pts, Verified 88.6%), computer use (OSWorld +4.7 pts), and tool orchestration (MCP-Atlas +6.9 pts). GPT-5.5 leads Terminal-Bench 2.1 under both harnesses — a genuine win for terminal-centric and latency-sensitive coding pipelines. For broad agentic coverage, Opus 4.8 has the stronger profile; for Codex CLI-native terminal work, GPT-5.5 is the sharper tool.
Harness methodology — Terminal-Bench 2.1

Terminal-Bench 2.1 figures are harness-dependent. Opus 4.8 ran at high effort via the Terminus-2 public harness (74.6%). GPT-5.5 scored 78.2% on Terminus-2 and 83.4% via its native Codex CLI harness. The higher Codex CLI number reflects GPT-5.5's deeper integration with its own tooling — not an apples-to-apples comparison with the Terminus-2 Opus 4.8 figure. Treat both as directional signals; run your own workloads to get task-specific numbers.

03Reasoning & KnowledgeHLE, ArXivMath, GDPval — a mixed picture.

On reasoning and knowledge-work evals, Opus 4.8 leads on most benchmarks but the margins vary considerably. Humanity's Last Exam without tools: 49.8% vs 41.4% — an 8.4-point lead that is the largest gap in this category. With tools: 57.9% vs 52.2% — 5.7 points. On ArXivMath, with GPT-5.5 running at xhigh effort, the scores are 71.82% vs 71.48% — effectively tied, within any reasonable noise threshold.

GDPval-AA, the knowledge-work ELO benchmark covering diverse professional tasks, shows Opus 4.8 leading GPT-5.5's xhigh setting by approximately 121 ELO points, corresponding to roughly a 66.7% pairwise win rate per the system card. The bar chart below uses relative ELO percentages for visual comparison; the raw ELO scores are 1890 (Opus 4.8) vs 1769 (GPT-5.5 xhigh).

Finance Agent v2, a Vals AI benchmark for financial analysis tasks, is one of the closer calls: Opus 4.8 at 53.9% vs GPT-5.5 at 51.8%. The v2 harness is deliberately harder than v1, and both models sit in the low-50s — a reminder that many professional knowledge-work benchmarks remain genuinely difficult for current frontier models. For teams evaluating either model for financial workflow automation, the agent coding cost and multi-model composition guide covers production cost modelling for stacks that combine both models.

The pattern across this category: Opus 4.8 has a meaningful lead on multi-domain professional reasoning (HLE, GDPval-AA), while math (ArXivMath) sits at parity. Neither model dominates the other on every axis, which means the decision should hinge on which workloads dominate your use case — not on any single headline number.

Reasoning & knowledge-work benchmarks

GPT-5.5Opus 4.8
Humanity's Last Exam — no tools+8.4 · Opus 4.8
GPT-5.5
41.4%
Opus 4.8
49.8%
Humanity's Last Exam — with tools+5.7 · Opus 4.8
GPT-5.5
52.2%
Opus 4.8
57.9%
ArXivMath (GPT-5.5 at xhigh)Effectively tied
GPT-5.5
71.48%
Opus 4.8
71.82%
GDPval-AA (knowledge-work ELO)+121 ELO · Opus 4.8
GPT-5.5
65%
Opus 4.8
76%
Finance Agent v2 (Vals AI)+2.1 · Opus 4.8
GPT-5.5
51.8%
Opus 4.8
53.9%
Reasoning verdictOpus 4.8 leads multi-domain professional reasoning (HLE +8.4 pts without tools, +5.7 pts with tools; GDPval-AA +121 ELO) and narrowly edges Finance Agent v2. ArXivMath is a statistical tie. For demanding knowledge-work pipelines where breadth of expertise matters — research, analysis, professional consulting contexts — Opus 4.8 has the better profile.
On GDPval-AA, Opus 4.8 leads GPT-5.5 xhigh by approximately 121 ELO points — roughly a 66.7% pairwise win rate. That is the kind of gap that translates to visible quality differences in knowledge-work-heavy production deployments.Digital Applied analysis based on the Anthropic Opus 4.8 system card, May 28, 2026

04Long Context & CostFlat pricing vs surcharge — the economics of long context.

Long-context performance and pricing are the two axes most likely to drive the economics of agentic pipelines at scale — and they are deeply connected in this comparison. Both models have 1M-token context windows. The headline is parity. The retrieval reality and the pricing reality are not.

On GraphWalks long-context F1 — a benchmark that tests factual retrieval over very large context windows using graph traversal tasks — Opus 4.8 leads GPT-5.5 across every configuration tested. At BFS 256K, the lead is 12.2 points (85.9% vs 73.7%). At BFS 1M, it widens to 22.7 points (68.1% vs 45.4%). At Parents 1M, it is 24.8 points (83.3% vs 58.5%). These are large enough gaps to be architecturally decisive: for workloads that routinely reason over entire codebases, multi-document research corpora, or long agent execution traces, Opus 4.8 can be expected to retrieve more reliably at the upper end of the context window.

The pricing structure compounds this advantage for long-context workloads. Opus 4.8 charges a flat $5 per million input tokens and $25 per million output tokens regardless of where in the context window you are operating. GPT-5.5 charges $5/$30 for sessions under 272K input tokens — a competitive rate — but applies a long-context surcharge above that threshold, reportedly approximately doubling the input rate and increasing the output rate by 1.5× for the whole session.

The inflection point is roughly the 272K-token session boundary. For workloads that consistently stay below 272K input tokens, GPT-5.5's $5/$30 base rate makes it modestly more expensive on output than Opus 4.8's $5/$25, but the gap is manageable. Above 272K, GPT-5.5's effective per-session cost rises sharply while Opus 4.8's stays constant. For teams doing frequent full-codebase reasoning or long-document analysis, the Opus 4.8 flat-rate model can become materially cheaper even at equal or lower individual benchmark scores — though actual savings depend on your session composition.

The fast mode tier ($10/$50 per 1M) also deserves mention. Opus 4.8 offers a 2.5× speed increase at 2× the standard price — useful for latency-sensitive agentic loops where raw throughput matters more than cost per token. This tier sits between the GPT-5.5 base tier and the GPT-5.5 Pro tier ($30/$180) in both price and capability positioning.

GraphWalks long-context F1 — Opus 4.8 vs GPT-5.5

Source: Opus 4.8 system card
BFS 256K
GPT-5.5
73.7%
Opus 4.8
85.9%
+12.2
BFS 1M
GPT-5.5
45.4%
Opus 4.8
68.1%
+22.7
Parents 1M
GPT-5.5
58.5%
Opus 4.8
83.3%
+24.8
Why this mattersContext-window parity (1M vs 1M) does not mean retrieval parity. The 22–24 point leads Opus 4.8 holds at 1M tokens represent a qualitative difference — at those retrieval rates, GPT-5.5 may miss roughly 1 in 5 facts that Opus 4.8 finds, at the same context size.
Cost illustration — 50M input tokens / 25M output tokens per month
GPT-5.5 — under 272K input
$400
Opus 4.8 — flat rate (any context)
$375
GPT-5.5 — above 272K input (surcharge)
$725
Opus 4.8 fast mode
$750
GPT-5.5 Pro — premium tier
$2,400

Illustrative monthly cost at 50M input + 25M output tokens. GPT-5.5 long-context surcharge approximated at 2× input / 1.5× output for sessions above 272K input. Actual costs depend on session composition and API pricing at time of purchase — verify current rates before procurement. Fast-mode Opus 4.8 pricing at $10/$50 per 1M.

05Ecosystem & WorkflowClaude Code dynamic workflows vs OpenAI Codex tooling.

Benchmark scores only capture part of the production decision. Ecosystem fit — how well a model integrates with your existing tooling, deployment targets, and developer workflows — often determines which model a team actually ships with, even when benchmark margins are marginal.

Anthropic ecosystem. Opus 4.8 launched alongside dynamic workflows in Claude Code (research preview) — a capability that allows Opus 4.8 to plan and execute multi-step agentic tasks while adapting its tool-use strategy mid-execution. This is a meaningful workflow-level advantage for teams already on Claude Code. Opus 4.8 is available on Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry from day one, covering the major enterprise deployment paths. The Opus 4.8 launch guide covers the dynamic workflows capability in depth.

OpenAI ecosystem. GPT-5.5 is the default model in ChatGPT for Plus/Pro/Business/Enterprise subscribers and ships with deep Codex CLI integration — the same harness that produced its 83.4% Terminal-Bench result. For teams whose coding pipeline is built around Codex, that integration is not just a benchmark number; it is a toolchain compatibility advantage. The API rollout for GPT-5.5 was described as rolling out after launch, with additional safety and security work needed for API-scale serving.

Multi-model patterns. The emerging production pattern among engineering-led teams is not to pick one model but to route by task shape. The three-model agentic coding cost comparison (Gemini 3.5 Flash / GPT-5.5 / Opus 4.7) documented this routing approach at the previous model generation. With Opus 4.8, the router logic is updated: long-context and broad agentic work routes to Opus 4.8; terminal-centric coding routes to GPT-5.5 under Codex; bulk and cost-sensitive tasks route to smaller models. Our AI transformation service work with engineering teams increasingly centers on building these multi-model orchestration layers — the benchmark spread between labs has made single-vendor strategies genuinely suboptimal for most large workloads.

One area worth tracking as MCP adoption grows: Anthropic introduced the Model Context Protocol and Opus 4.8's MCP-Atlas lead (82.2% vs 75.3%) appears to reflect deeper native integration. Teams whose agent stacks are MCP-heavy should weight that benchmark accordingly; teams whose agents are primarily built around OpenAI function calling or the Assistants API may not see the same MCP-advantage in production.

06VerdictPick by workload, not headline score.

Both models are frontier-tier. On the benchmarks where they diverge, the margins range from narrow (ArXivMath: a tie; Finance Agent v2: 2.1 points) to substantial (GraphWalks 1M Parents: 24.8 points; SWE-bench Pro: 10.6 points). The routing guidance below reflects where each model's strengths translate to production workload outcomes — not just benchmark leaderboard rankings.

Pick Opus 4.8
Long context, broad agentic, MCP-heavy work

Use Opus 4.8 for workloads that routinely exceed 272K tokens (flat pricing + larger retrieval lead makes it both better and cheaper); codebase-resolution and PR-fix tasks (SWE-bench Pro +10.6 pts); tool-orchestration pipelines built on MCP (MCP-Atlas +6.9 pts); computer use and GUI automation (OSWorld +4.7 pts); professional knowledge-work reasoning (GDPval-AA +121 ELO); and teams already on Claude Code who want dynamic workflows.

Opus 4.8 — broad agentic default
Pick GPT-5.5
Terminal agents, Codex-native, short context

Use GPT-5.5 for terminal-centric agentic coding under the Codex CLI harness (83.4% Terminal-Bench); workloads that stay under 272K input tokens where the $5/$30 base rate is competitive; teams with existing OpenAI Codex or GPT-4o tooling that would require migration cost to switch; and workflows where GPT-5.5 Pro ($30/$180) is warranted for very high accuracy on constrained tasks.

GPT-5.5 — terminal and Codex-native
Either model
Math reasoning, mid-size coding, general tasks

For ArXivMath and general math reasoning (benchmarks suggest they are tied), most mid-size coding tasks where neither SWE-bench Pro margins nor Terminal-Bench margins dominate your specific task distribution, and general text and knowledge work where Finance Agent v2-level margins (2 points) are smaller than your own prompt-engineering or system-design variation. In these cases, ecosystem fit and pricing should be the deciding factors.

Either — route on ecosystem + cost
Practical routing table — May 2026
Long context
Entire codebases, multi-document research, long agent traces at 272K+
Opus 4.8
Codebase resolution
SWE-bench-style PRs, bug fixes, refactors — Opus leads by 10.6 pts
Opus 4.8
Terminal agents
Terminal-centric, command-line agents — GPT-5.5 leads under Codex harness
GPT-5.5
MCP / tool use
Heavy tool orchestration via MCP — Opus leads MCP-Atlas by 6.9 pts
Opus 4.8
Computer use
Browser automation, GUI agents — Opus 4.8 leads OSWorld by 4.7 pts
Opus 4.8
Short context, budget
Under 272K tokens — GPT-5.5 base at $5/$30 is competitive
GPT-5.5
Math / reasoning
ArXivMath is a tie; HLE favours Opus 4.8; use either
Either model
OpenAI ecosystem
Teams invested in Codex CLI, GPT-4o tooling, or OpenAI-native stacks
GPT-5.5

Routing based on published benchmark data as of May 28, 2026. Task shape, prompt design, and system architecture all affect real-world outcomes — validate against your own workloads before finalising a routing strategy. See the Claude Opus 4.7 complete guide for the prior generation's routing patterns.

Final verdict · May 2026Opus 4.8 is the stronger default for broad agentic and long-context work: it leads SWE-bench Pro, OSWorld, MCP-Atlas, HLE, GDPval-AA, and all GraphWalks configurations — and its flat pricing makes it cheaper at scale when sessions routinely exceed 272K tokens. GPT-5.5 holds a real, harness-dependent lead on Terminal-Bench and is the right choice for teams whose pipelines are built around Codex CLI, or whose workloads stay under the 272K surcharge threshold. Neither model wins every category; a multi-model routing layer is the highest-return architecture for production stacks with diverse workload shapes.
Conclusion

Opus 4.8 is the broader agentic default — but GPT-5.5 keeps its terminal niche.

Claude Opus 4.8 arrives with the widest benchmark lead over GPT-5.5 of any Opus generation to date — double-digit gaps on SWE-bench Pro, OSWorld, and GraphWalks long-context retrieval, and a flat pricing model that makes it materially cheaper for workloads that routinely exceed 272K tokens. For teams doing broad agentic work across codebase resolution, tool orchestration, computer use, and multi-document reasoning, the combination of benchmark and economic advantage points clearly toward Opus 4.8 as the default choice.

GPT-5.5 retains a genuine and meaningful win on Terminal-Bench 2.1, particularly under its native Codex CLI harness. That is not a footnote; for terminal-centric pipeline teams, a 4–9 point Terminal-Bench lead translates to real task completion rate differences. The OpenAI ecosystem advantage — ChatGPT integration, Codex toolchain depth, the Pro tier — also represents real switching cost for teams already invested there. The honest framing is that GPT-5.5 is the better choice for a specific and important subset of agentic coding workloads, even as Opus 4.8 holds the edge across the broader benchmark landscape.

The forward projection is that single-model strategies are becoming suboptimal. The benchmark spread between Opus 4.8 and GPT-5.5 is wide enough, and complementary enough in direction, that production stacks with diverse workload shapes can capture measurable gains from routing by task type. The teams that move earliest to build disciplined multi-model orchestration layers — rather than picking a winner and deploying uniformly — are likely to hold a compounding quality and cost advantage as both labs continue to improve their models on different axes. Our AI transformation service helps engineering teams design and implement exactly these orchestration architectures.

Multi-model strategy for production AI

From benchmark to production-ready strategy.

We help engineering teams and digital product organisations design multi-model AI stacks, evaluate frontier models against real workloads, and build the agentic infrastructure that compounds over time.

Free consultationExpert guidanceTailored solutions
What we work on

Frontier AI model strategy

  • Multi-model routing architecture design
  • Agentic pipeline benchmarking against real workloads
  • Long-context cost modelling and optimisation
  • Claude Code and Codex workflow integration
  • AI transformation programme design
FAQ · Claude Opus 4.8 vs GPT-5.5

Questions on Claude Opus 4.8 vs GPT-5.5.

It depends on the workload. Opus 4.8 leads on SWE-bench Pro (69.2% vs 58.6%), OSWorld-Verified computer use (83.4% vs 78.7%), MCP-Atlas tool use (82.2% vs 75.3%), Humanity's Last Exam, GDPval-AA, and all GraphWalks long-context retrieval configurations. GPT-5.5 leads Terminal-Bench 2.1 — scoring 78.2% on the Terminus-2 public harness and 83.4% under its native Codex CLI — and is competitively priced for sessions under 272K input tokens. For broad agentic and long-context work, Opus 4.8 has the stronger profile. For terminal-centric coding under the Codex harness, GPT-5.5 is the stronger choice. On general math reasoning (ArXivMath), they are effectively tied.