OpenAI released GPT-5.5 on April 23, 2026 — the next default frontier model in ChatGPT and Codex, and the first OpenAI model to ship with a 1M-token API context window. The headline is simple: state-of-the-art agentic coding at per-token latency parity with GPT-5.4.

GPT-5.5 leads agentic-coding benchmarks at 82.7% on Terminal-Bench 2.0, hits 73.1% on the internal Expert-SWE long-horizon benchmark, and reaches 84.9% on GDPval — all while matching GPT-5.4 per-token latency in real-world serving and using significantly fewer tokens to complete the same Codex tasks. Alongside the standard model, OpenAI shipped GPT-5.5 Pro for the hardest research, math, and retrieval work; OpenAI's API pricing page now lists GPT-5.5 pricing for developers.

For teams that already deployed GPT-5.4, the migration story is straightforward — same API surface, lower token spend per task, and a meaningful jump in agentic capability. This guide covers the release in full: variants, benchmarks, pricing, inference co-design with NVIDIA GB200/GB300, and the cybersecurity posture change under OpenAI's Preparedness Framework.

Key takeaways

01
State-of-the-art agentic coding without latency loss.82.7% on Terminal-Bench 2.0 (vs 75.1% for GPT-5.4 and 69.4% for Claude Opus 4.7), 73.1% on Expert-SWE, while matching GPT-5.4 per-token latency in real-world serving and using fewer tokens per task.
02
1M-token context with strong long-context retrieval.On OpenAI MRCR v2 8-needle 512K–1M, GPT-5.5 jumps to 74.0%, up from 36.6% on GPT-5.4 — a step change for entire-codebase and multi-document workflows.
03
Computer use enters production-viable territory.78.7% OSWorld-Verified (up from 75.0%) and 84.9% GDPval, with native ability to operate browsers and desktop apps without external frameworks.
04
Cybersecurity rated High with new safeguards.OpenAI rates GPT-5.5's cyber capabilities High under its Preparedness Framework. Stricter classifiers limit higher-risk requests by default, while Trusted Access for Cyber expands capability to verified defenders.

01 — Release OverviewA step change, not a benchmark refresh.

GPT-5.5 is positioned as a step change in agentic capability rather than a pure benchmark refresh. OpenAI's framing is consistent across the launch materials: the model understands user intent faster, uses tools more efficiently, and stays coherent across long multi-step tasks — coding, browsing, computer operation, document and spreadsheet work, and early scientific research. On Artificial Analysis's Coding Agent Index, OpenAI reports GPT-5.5 delivering state-of-the-art intelligence at roughly half the cost of competing frontier coding models on a token-spend basis.

Per-token latency parity

Larger, more capable models are typically slower to serve. GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while operating at a higher level of intelligence — a result of co-design with NVIDIA GB200/GB300 NVL72 systems and inference improvements landed with help from Codex itself.

The release lands one month after the GPT-5.4 family rollout covered in our GPT-5.4 complete guide, and continues the cadence of frequent frontier-model updates documented in our twelve-models-in-a-week analysis. GPT-5.5 is also available immediately in Codex with a 400K-token window across Plus, Pro, Business, Enterprise, Edu, and Go plans, and a Fast mode that generates tokens 1.5× faster at 2.5× the cost.

02 — VariantsThinking, Pro, and Codex Fast mode.

GPT-5.5 ships in two API SKUs and a few different surface configurations. In ChatGPT, the standard model is exposed as GPT-5.5 Thinking for Plus, Pro, Business, and Enterprise users. GPT-5.5 Pro is reserved for Pro, Business, and Enterprise users in ChatGPT. In Codex, GPT-5.5 is available across all paid plans with a 400K context window, with an optional Fast mode for users who want lower latency at higher cost.

Standard

GPT-5.5 Thinking

$5 / $30 per 1M

Default frontier model. Faster, more concise answers than GPT-5.4 with state-of-the-art agentic coding and computer use. The right default for most production workloads.

Production default

Premium

GPT-5.5 Pro

$30 / $180 per 1M

Maximum-accuracy variant. Leads BrowseComp at 90.1% and FrontierMath Tier 4 at 39.6%. Best when the cost of a wrong answer dwarfs the cost of the call.

Research tier

Codex

Fast mode

1.5× speed · 2.5× cost

Available in Codex for tight feedback loops in interactive coding sessions where latency dominates the developer experience and per-task cost is bounded.

Interactive coding

For teams already running GPT-5.4 or earlier, the practical default is to swap to standard GPT-5.5 and reserve Pro for specific high-stakes pipelines — research synthesis, deep BrowseComp-style retrieval, multi-step math, or complex legal/financial reasoning. Pro shows clear gains on those evals: 90.1% BrowseComp (vs 84.4% standard), 52.4% FrontierMath Tier 1–3 (vs 51.7%), and 39.6% FrontierMath Tier 4 (vs 35.4%).

03 — Agentic CodingWhere GPT-5.5 separates from the field.

Agentic coding is where GPT-5.5 separates most clearly from prior generations and from competitors. On Terminal-Bench 2.0 — which tests complex command-line workflows requiring planning, iteration, and tool coordination — GPT-5.5 hits 82.7%, well ahead of GPT-5.4 at 75.1%. On the internal Expert-SWE eval, which targets long-horizon coding tasks with a median estimated human completion time of 20 hours, GPT-5.5 reaches 73.1% versus 68.5% for GPT-5.4. To turn those scores into repeatable practice, see our production-tested GPT-5.5 Pro coding workflows for refactor, review, debug, and migration patterns with cost and success-rate data.

Coding benchmarks · GPT-5.5 vs previous generation

Source: OpenAI GPT-5.5 release

Terminal-Bench 2.0Planning + execution · CLI workflows

82.7%

+7.6

Expert-SWELong-horizon · ~20h median human time

73.1%

+4.6

SWE-Bench ProCodebase resolution · public benchmark

58.6%

+0.9

BrowseCompMulti-source web research

84.4%

Pro 90.1%

Tau2-bench TelecomCustomer-service tool use · no tuning

98.0%

New

GPT-5.5GPT-5.4 baseline

“It's definitely our strongest model yet on coding, both measured by benchmarks and based on the feedback we've gotten from trusted partners, as well as our own experience.”— Amelia Glaese, VP of Research, OpenAI

The qualitative reports from early testers track the numbers. Dan Shipper of Every called GPT-5.5 “the first coding model I've used that has serious conceptual clarity,” after using it to reproduce an architectural rewrite a senior engineer had previously needed days to land. Pietro Schirano of MagicPath described GPT-5.5 merging a branch with hundreds of frontend and refactor changes into a substantially-changed main branch in one shot in about 20 minutes. For agencies comparing options, the broader picture is covered in our Claude Opus 4.7 vs GPT-5.4 agentic coding analysis — GPT-5.5 extends the gap further on Terminal-Bench and Expert-SWE.

04 — Computer UseBrowsers, desktops, and the full knowledge-work loop.

GPT-5.5 extends its lead beyond pure coding into the broader knowledge-work loop: finding information, understanding what matters, using tools, checking outputs. On OSWorld-Verified, GPT-5.5 reaches 78.7% — up from 75.0% on GPT-5.4. On GDPval, which measures agents producing well-specified knowledge work across 44 occupations, GPT-5.5 hits 84.9%.

Browse & retrieve

BrowseComp

84.4%

Web research, multi-source synthesis, and citation chains with full tool support across Search, URL Context, Code Execution, and File Search.

Pro: 90.1%

Operate software

OSWorld-Verified

78.7%

Native ability to see the screen, click, type, and navigate browser and desktop interfaces. Brings reliable computer use into production-viable territory.

+3.7 vs GPT-5.4

Tool use at scale

Tau2-bench Telecom

98.0%

Without prompt tuning. The model understands task intent better and is meaningfully more token-efficient on customer-service and tool-orchestration workflows.

Toolathlon 55.6%

Domain-specific knowledge work shows the same pattern. On OfficeQA Pro — Databricks' benchmark for office-software tasks — GPT-5.5 scores 54.1%. On OpenAI's internal Investment Banking Modeling Tasks eval, GPT-5.5 hits 88.5%. FinanceAgent v1.1 comes in at 60.0%. These are the workloads where agentic AI actually displaces an analyst hour rather than just summarizing a report.

OpenAI shared concrete internal examples. Their Finance team used Codex to review 24,771 K-1 tax forms (71,637 pages), accelerating the task by two weeks compared to the prior year. A Go-to-Market employee automated weekly business reports for a 5–10 hour per week saving. OpenAI states that more than 85% of the company uses Codex weekly across software engineering, finance, comms, marketing, data science, and product. The 1M-token context window is what makes many of these workflows tractable — full-codebase analysis and entire-policy- corpus reasoning start to behave like normal model calls rather than exotic capabilities that need careful chunking strategies. For agencies running AI transformation programs, that's the practical lift.

05 — Scientific ResearchFrom Q&A to multi-day projects.

Scientific research is the most surprising area of progress in the GPT-5.5 release. The model shows clear gains on multi-stage data analysis workflows that look more like multi-day research projects than standalone Q&A. On GeneBench — a new evaluation of multi-stage scientific data analysis in genetics and quantitative biology — GPT-5.5 scores 25.0% (Pro: 33.2%) versus 19.0% on GPT-5.4. On BixBench, designed around real-world bioinformatics, GPT-5.5 reaches 80.5%, the leading published score among major frontier models.

FrontierMath Tier 1–3Research-grade math: 51.7%Pro 52.4%GPT-5.4 47.6%
FrontierMath Tier 4Hardest tier · research frontier: 35.4%Pro 39.6%GPT-5.4 27.1%
GPQA DiamondGraduate-level STEM: 93.6%
Humanity's Last Exam (with tools)Cross-domain reasoning: 52.2%Pro 57.2%
ARC-AGI-1Abstract reasoning: 95.0%
ARC-AGI-2Harder abstract reasoning: 85.0%GPT-5.4 73.3% · Opus 4.7 75.8%

OpenAI disclosed that an internal version of GPT-5.5 with a custom harness contributed to a new asymptotic proof about off-diagonal Ramsey numbers — later verified in Lean. Bartosz Naskręcki at Adam Mickiewicz University used GPT-5.5 in Codex to build an algebraic-geometry app from a single prompt in 11 minutes. Derya Unutmaz of the Jackson Laboratory used GPT-5.5 Pro to analyze a 62-sample, ~28,000-gene expression dataset and produce a research report he said would have taken his team months. For agencies and operators in regulated, knowledge-dense industries, the implication is that GPT-5.5 Pro becomes a credible co-analyst on structured technical work.

06 — Inference EfficiencyNVIDIA GB200/GB300 co-design.

The headline production fact about GPT-5.5 is that it serves at GPT-5.4 per-token latency despite being a more capable model. That isn't an accident of compiler tuning — it's the result of co-designing the model for, training it with, and serving it on NVIDIA GB200 and GB300 NVL72 systems. OpenAI describes inference here as an integrated system, and explicitly credits Codex and GPT-5.5 itself with landing key improvements in the serving stack.

+20%

Token generation speedup

Codex-authored serving heuristics

GPT-5.5 helped optimize the infrastructure that serves it.

Before GPT-5.5, OpenAI split requests on an accelerator into a fixed number of chunks so big and small requests could share the same GPU efficiently. A static split isn't optimal for all traffic shapes — so Codex analyzed weeks of production traffic patterns and wrote custom partitioning heuristics, lifting token-generation speed by over 20%.

The reflexive loop is the interesting story: frontier models built and served with help from the previous generation of frontier models.

From an integration standpoint, this matters for cost modeling. GPT-5.5 at $5/$30 per 1M tokens is more expensive than GPT-5.4 on paper, but the company reports that in Codex it has carefully tuned the experience so GPT-5.5 delivers better results with fewer tokens than GPT-5.4 for most users — meaning real per-task spend often falls. The practical advice is to A/B test on representative tasks rather than extrapolate from per-token list price.

07 — CybersecurityRated High under Preparedness.

OpenAI is treating GPT-5.5's biological/chemical and cybersecurity capabilities as High under its Preparedness Framework. While GPT-5.5 didn't reach the framework's Critical level for cyber, the evaluation results show meaningful step-ups: 81.8% on CyberGym (vs 79.0% on GPT-5.4 and 73.1% on Claude Opus 4.7) and 88.1% on the internal expanded Capture-the-Flag challenge tasks (vs 83.7%). The company shipped tighter classifiers around higher-risk activity, sensitive cyber requests, and protections against repeated misuse.

Trusted Access for Cyber

Verified defenders meeting trust signals can access cyber-permissive capabilities through Codex with fewer restrictions for legitimate defensive work. Organizations defending critical infrastructure can apply to access cyber-permissive models like GPT-5.4-Cyber under stricter security requirements. The aim is to democratize defensive capability while keeping the most dual-use workflows behind verification.

For agencies and platforms building security products, the architectural takeaway is that GPT-5.5 is now genuinely useful for triage, vulnerability scanning, fix-suggestion, and SOC workflows — but expect more refusals on edge-case prompts that resemble offensive testing, especially in the early weeks as classifiers are tuned. Teams with legitimate defensive use cases should evaluate Trusted Access via Codex rather than fighting against standard-tier guardrails.

08 — Pricing & AvailabilityPremium standard tier at 1M context.

GPT-5.5's API pricing positions it as the premium standard frontier tier rather than a cost-leader. At $5 per 1M input tokens and $30 per 1M output tokens, it's 2× the input price of GPT-5.4 Standard. Pro at $30/$180 per 1M tokens sits at the same headline rate as GPT-5.4 Pro. Both ship with a 1M-token context window, Batch and Flex pricing at half the standard rate, and Priority processing at 2.5×.

Model

Input / Output (per 1M)

Context

Best fit

gpt-5.5

$5.00 / $30.00

Default for agentic coding, computer use, knowledge work

gpt-5.5-pro

$30.00 / $180.00

Deep research, math, BrowseComp-style retrieval

Codex (GPT-5.5)

Subscription tiers

400K

Interactive coding; Fast mode 1.5× speed at 2.5× cost

GPT-5.5 is available to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. GPT-5.5 Pro is available to Pro, Business, Enterprise, and Edu users in ChatGPT. OpenAI's API pricing page now lists GPT-5.5 at $5/$30 per 1M tokens; confirm account-level access in the platform console because rollouts and workspace eligibility can vary. For Codex specifically, the new model is available across Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K-token context window. See our Codex for almost everything release guide for the recommended starting point.

09 — Choosing GPT-5.5Task-shaped, not vendor-shaped.

The frontier-model choice is increasingly task-shaped rather than vendor-shaped. GPT-5.5 leads on agentic coding, computer use, and cybersecurity. Claude Opus 4.7 remains strong on SWE-Bench Pro and certain autonomy-heavy refactors. Gemini 3.1 Pro leads on raw ARC-AGI-1 and competes hard on price for large-context workloads. Our broader frontier-model comparison is documented in the GPT-5.4 vs Opus 4.6 vs Gemini 3.1 Pro analysis.

Default workload

Agentic coding & Codex

State-of-the-art Terminal-Bench 2.0 (82.7%) and Expert-SWE (73.1%) scores. Same per-token latency as GPT-5.4; fewer tokens per task. Optional Codex Fast mode for interactive feedback loops.

Pick GPT-5.5 standard

Premium tier

Deep research & hard math

Leads BrowseComp at 90.1%, FrontierMath Tier 4 at 39.6%, and GeneBench at 33.2%. Best when the cost of an incorrect answer dwarfs the call cost.

Pick GPT-5.5 Pro

Production-viable

Computer use automation

78.7% OSWorld-Verified, native browser and desktop operation, strongest tool-orchestration scores OpenAI has published. Pair with sandboxing and human-in-the-loop checkpoints for production rollout.

Pick GPT-5.5 standard

Production architecture

Multi-vendor routing

Default to GPT-5.5 for agentic coding, computer use, and long-context retrieval. Keep Opus 4.7 for SWE-Bench-style refactors and MCP-heavy work. Use Gemini 3.1 Pro for cost-sensitive long-context bulk tasks.

Route by task class

For teams currently on GPT-5.4, the migration is straightforward — same API contract, same Codex surface, lower per-task token spend on most workflows, and a meaningful jump in agentic capability. For teams primarily on Claude or Gemini, the question is whether GPT-5.5's lead on Terminal-Bench, Expert-SWE, GDPval, and OSWorld translates to lift on your specific evals — the answer is usually yes for agentic coding and computer use.

10 — ConclusionThe most consequential frontier release of the quarter.

The shape of frontier, April 2026

State-of-the-art agentic coding without a cost-shape change.

GPT-5.5 is the most consequential frontier-model release of the quarter. State-of-the-art agentic-coding scores, a 1M-token context window with strong long-context retrieval, native computer use that competes with the best published numbers, and per-token latency parity with GPT-5.4 add up to a model that materially changes what production agent systems can do — without changing their cost shape much for most workflows.

For most teams, the practical move is simple: standardize on GPT-5.5 for agentic coding, computer use, and knowledge-work agents; reserve GPT-5.5 Pro for deep research and the hardest evaluation-grade tasks; keep a multi-model router in place to cover edge cases where Claude or Gemini still win on a specific metric. The cybersecurity posture change — High under Preparedness, stricter classifiers, Trusted Access for Cyber — is worth flagging to security teams now so they can route legitimate defensive use through the right channel.

Codex itself helped land the inference improvements that make this possible, which is increasingly the pattern: frontier models built and served with help from the previous generation of frontier models. The planning signal is clear — ship the routing layer, the evals, and the production observability now, so that when the next jump lands, the only thing that changes is which model the router picks.

State-of-the-art agentic coding, 1M context, and per-token latency parity with GPT-5.4.