GPT-5.5 Complete Guide: Thinking, Pro & 1M Context
OpenAI's GPT-5.5 ships April 23, 2026 with 1M context, Thinking and Pro variants, 82.7% Terminal-Bench, and same latency as GPT-5.4. Pricing inside.
Terminal-Bench 2.0 (SOTA)
API Context Tokens
OSWorld-Verified
GDPval (wins or ties)
Key Takeaways
OpenAI released GPT-5.5 on April 23, 2026, the next default frontier model in ChatGPT and Codex and the first OpenAI model to ship with a 1M-token API context window. GPT-5.5 leads agentic-coding benchmarks at 82.7% on Terminal-Bench 2.0, hits 73.1% on the internal Expert-SWE long-horizon benchmark, and reaches 84.9% on GDPval — all while matching GPT-5.4 per-token latency in real-world serving and using significantly fewer tokens to complete the same Codex tasks.
Alongside the standard model, OpenAI shipped GPT-5.5 Pro for the hardest research, math, and retrieval work, and announced API availability on the Responses and Chat Completions endpoints shortly. The release also moves cybersecurity into the model's High risk tier under OpenAI's Preparedness Framework, with new safeguards and an expanded Trusted Access for Cyber program. For teams that already deployed GPT-5.4, the migration story is straightforward — same API surface, lower token spend per task, and a meaningful jump in agentic capability.
GPT-5.5 Release Overview
GPT-5.5 is positioned as a step change in agentic capability rather than a pure benchmark refresh. OpenAI's framing is consistent across the launch materials: the model understands user intent faster, uses tools more efficiently, and stays coherent across long multi-step tasks — coding, browsing, computer operation, document and spreadsheet work, and early scientific research. On Artificial Analysis's Coding Agent Index, OpenAI reports GPT-5.5 delivering state-of-the-art intelligence at roughly half the cost of competing frontier coding models on a token-spend basis.
Per-token latency parity: Larger, more capable models are typically slower to serve. GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while operating at a higher level of intelligence — a result of co-design with NVIDIA GB200/GB300 NVL72 systems and inference improvements landed with help from Codex itself.
The release lands one month after the GPT-5.4 family rollout covered in our GPT-5.4 complete guide, and continues the cadence of frequent frontier-model updates documented in our twelve-models-in-a-week analysis. GPT-5.5 is also available immediately in Codex with a 400K-token window across Plus, Pro, Business, Enterprise, Edu, and Go plans, and a Fast mode that generates tokens 1.5x faster at 2.5x the cost.
Variants: Thinking, Pro, and Fast Mode
GPT-5.5 ships in two API SKUs and a few different surface configurations. In ChatGPT, the standard model is exposed as GPT-5.5 Thinking for Plus, Pro, Business, and Enterprise users — the general-purpose tier optimized for everyday work that benefits from reasoning. GPT-5.5 Pro is reserved for Pro, Business, and Enterprise users in ChatGPT and is targeted at the hardest questions and highest-accuracy outputs. In Codex, GPT-5.5 is available across all paid plans with a 400K context window, with an optional Fast mode for users who want lower latency at higher cost.
Standard frontier model. $5 per 1M input, $30 per 1M output in the API. Faster, more concise answers than GPT-5.4 with state-of-the-art agentic coding and computer use. The right default for most production workloads.
Maximum-accuracy variant. $30 per 1M input, $180 per 1M output. Leads BrowseComp at 90.1% and FrontierMath Tier 4 at 39.6%. Best for deep research, technical analysis, and any workflow where the cost of a wrong answer dwarfs the cost of the call.
In Codex, an optional Fast mode generates tokens 1.5x faster at 2.5x the cost. Useful for tight feedback loops in interactive coding sessions where latency dominates the developer experience and cost-per-task is bounded.
For teams already running GPT-5.4 or earlier, the practical default is to swap to standard GPT-5.5 and reserve Pro for specific high-stakes pipelines — research synthesis, deep BrowseComp-style retrieval, multi-step math, or complex legal/financial reasoning. Pro shows clear gains on those evals: 90.1% BrowseComp (vs 84.4% standard), 52.4% FrontierMath Tier 1–3 (vs 51.7%), and 39.6% FrontierMath Tier 4 (vs 35.4%). For everything else, the standard model already leads on agentic coding, GDPval, OSWorld, and CyberGym.
Agentic Coding: 82.7% Terminal-Bench Lead
Agentic coding is where GPT-5.5 separates most clearly from prior generations and from competitors. On Terminal-Bench 2.0 — which tests complex command-line workflows requiring planning, iteration, and tool coordination — GPT-5.5 hits 82.7%, well ahead of GPT-5.4 at 75.1%, Claude Opus 4.7 at 69.4%, and Gemini 3.1 Pro at 68.5%. On the internal Expert-SWE eval, which targets long-horizon coding tasks with a median estimated human completion time of 20 hours, GPT-5.5 reaches 73.1% versus 68.5% for GPT-5.4. Across all three coding evals OpenAI publishes, GPT-5.5 improves on GPT-5.4 while using fewer tokens.
| Benchmark | GPT-5.5 | GPT-5.4 | Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 75.1% | 69.4% | 68.5% |
| Expert-SWE (Internal) | 73.1% | 68.5% | — | — |
| SWE-Bench Pro | 58.6% | 57.7% | 64.3%* | 54.2% |
*Anthropic reported signs of memorization on a subset of SWE-Bench Pro problems for Claude Opus 4.7.
The qualitative reports from early testers are consistent with the numbers. Dan Shipper of Every called GPT-5.5 "the first coding model I've used that has serious conceptual clarity," after using it to reproduce the kind of architectural rewrite a senior engineer had previously needed days to land. Pietro Schirano of MagicPath described GPT-5.5 merging a branch with hundreds of frontend and refactor changes into a substantially-changed main branch in one shot in about 20 minutes. Senior engineers who tested the model said GPT-5.5 was noticeably stronger than GPT-5.4 and Claude Opus 4.7 at reasoning, autonomy, and predicting testing and review needs without explicit prompting.
For agencies and product teams comparing options, the broader agentic-coding picture is well covered in our Claude Opus 4.7 vs GPT-5.4 agentic coding analysis — the GPT-5.5 numbers extend that gap further on Terminal-Bench and Expert-SWE while Anthropic's SWE-Bench Pro lead remains the main counterpoint, with the memorization caveat noted in Anthropic's own release.
Computer Use and Knowledge Work
GPT-5.5 extends its lead beyond pure coding into the broader knowledge-work loop: finding information, understanding what matters, using tools, checking outputs, and turning raw material into something useful. On OSWorld-Verified, the standard evaluation for computer-use agents, GPT-5.5 reaches 78.7% — up from 75.0% on GPT-5.4 and ahead of Claude Opus 4.7 at 78.0%. On GDPval, which measures agents producing well-specified knowledge work across 44 occupations, GPT-5.5 hits 84.9% (vs 83.0% GPT-5.4, 80.3% Opus 4.7, 67.3% Gemini 3.1 Pro).
84.4% on BrowseComp (Pro: 90.1%). Strong web research, multi-source synthesis, and citation chains, with full tool support across Search, URL Context, Code Execution, and File Search.
Native ability to see what's on screen, click, type, and navigate browser and desktop interfaces. 78.7% OSWorld brings reliable computer use into production-viable territory for many internal workflows.
98.0% on Tau2-bench Telecom (without prompt tuning) and 55.6% on Toolathlon. The model understands task intent better and is meaningfully more token-efficient than predecessors on customer-service and tool-orchestration workflows.
OpenAI shared concrete internal usage examples that frame the knowledge-work upside. Their Comms team built an automated Slack agent for low-risk speaking-request triage after using GPT-5.5 in Codex to analyze six months of historical data and design a scoring framework. Finance used Codex to review 24,771 K-1 tax forms (71,637 pages) — excluding personal information from the workflow — accelerating the task by two weeks compared to the prior year. A Go-to-Market employee automated weekly business reports for a 5–10 hour per week saving. OpenAI states that more than 85% of the company uses Codex weekly across software engineering, finance, comms, marketing, data science, and product.
The 1M-token context window is what makes many of these workflows tractable. On long-context retrieval evals, GPT-5.5 jumps to 74.0% on OpenAI MRCR v2 8-needle 512K-1M (up from 36.6% on GPT-5.4 and 32.2% on Claude Opus 4.7), 81.5% at 256K-512K, and 87.5% at 128K-256K. For agencies running AI transformation programs, the implication is direct: full-codebase analysis, entire-policy-corpus reasoning, and multi-document research start to behave like normal model calls rather than exotic capabilities that need careful chunking strategies.
Scientific Research Capabilities
Scientific research is the most surprising area of progress in the GPT-5.5 release. The model shows clear gains on multi-stage data analysis workflows that look more like multi-day research projects than standalone Q&A. On GeneBench — a new evaluation of multi-stage scientific data analysis in genetics and quantitative biology — GPT-5.5 scores 25.0% (Pro: 33.2%) versus 19.0% on GPT-5.4. On BixBench, designed around real-world bioinformatics and data analysis, GPT-5.5 reaches 80.5% (vs 74.0%), the leading published score among major frontier models.
OpenAI also disclosed that an internal version of GPT-5.5 with a custom harness contributed to a new asymptotic proof about off-diagonal Ramsey numbers — a longstanding combinatorics result later verified in Lean. Outside the lab, Bartosz Naskręcki, an assistant professor of mathematics at Adam Mickiewicz University, used GPT-5.5 in Codex to build an algebraic-geometry app from a single prompt in 11 minutes, visualizing the intersection of quadratic surfaces and converting the resulting curve into a Weierstrass model. Derya Unutmaz of the Jackson Laboratory used GPT-5.5 Pro to analyze a 62-sample, ~28,000-gene expression dataset and produce a research report he said would have taken his team months.
Brandon White, co-founder of Axiom Bio, summarized the shift: "It's incredibly energizing to use OpenAI's new GPT-5.5 model in our harness, have it reason over massive biochemical datasets to predict human drug outcomes, and then see it deliver significant accuracy gains on our hardest drug discovery evals." For agencies and operators in regulated, knowledge-dense industries, the implication is that GPT-5.5 Pro becomes a credible co-analyst on structured technical work — not just a copywriter or chatbot.
Inference Efficiency: NVIDIA GB200/GB300 Co-Design
The headline production fact about GPT-5.5 is that it serves at GPT-5.4 per-token latency despite being a more capable model. That isn't an accident of compiler tuning — it's the result of co-designing the model for, training it with, and serving it on NVIDIA GB200 and GB300 NVL72 systems. OpenAI describes inference here as an integrated system rather than a set of point optimizations, and explicitly credits Codex and GPT-5.5 itself with helping land a number of the key improvements in the serving stack.
One specific improvement OpenAI called out: load balancing and partitioning heuristics for serving. Before GPT-5.5, OpenAI split requests on an accelerator into a fixed number of chunks so big and small requests could share the same GPU efficiently. A pre-determined static split is not optimal for all traffic shapes, so Codex was used to analyze weeks of production traffic patterns and write custom heuristic algorithms to optimally partition and batch work. The result was a token-generation speedup of more than 20%.
The reflexive loop is the more interesting story: Codex helped the team move faster from idea to benchmarkable implementation, while GPT-5.5 found and implemented key improvements in the stack itself. The model effectively helped optimize the infrastructure that serves it.
From an integration standpoint, this matters for cost modeling. GPT-5.5 at $5/$30 per 1M tokens is more expensive than GPT-5.4 on paper, but the company reports that in Codex it has carefully tuned the experience so GPT-5.5 delivers better results with fewer tokens than GPT-5.4 for most users — meaning real per-task spend often falls. For teams building on top of the API, the practical advice is to A/B test on representative tasks rather than extrapolate from per-token list price.
Cybersecurity Capabilities and Safeguards
OpenAI is treating GPT-5.5's biological/chemical and cybersecurity capabilities as High under its Preparedness Framework. While GPT-5.5 didn't reach the framework's Critical level for cyber, the evaluation results show meaningful step-ups: 81.8% on CyberGym (vs 79.0% on GPT-5.4 and 73.1% on Claude Opus 4.7) and 88.1% on the internal expanded Capture-the-Flag challenge tasks (vs 83.7%). The company shipped tighter classifiers around higher-risk activity, sensitive cyber requests, and protections against repeated misuse, with the explicit acknowledgement that some users will find the stricter posture noticeable as it's tuned over time.
Trusted Access for Cyber: Verified defenders meeting trust signals can access cyber-permissive capabilities through Codex with fewer restrictions for legitimate defensive work. Organizations defending critical infrastructure can apply to access cyber-permissive models like GPT-5.4-Cyber under stricter security requirements. The aim is to democratize defensive capability while keeping the most dual-use workflows behind verification.
For agencies and platforms building security products, the architectural takeaway is that GPT-5.5 is now genuinely useful for triage, vulnerability scanning, fix-suggestion, and SOC workflows — but expect more refusals on edge-case prompts that resemble offensive testing, especially in the early weeks as classifiers are tuned. Teams with legitimate defensive use cases should evaluate Trusted Access via Codex rather than fighting against standard-tier guardrails.
Pricing, 1M Context, and Availability
GPT-5.5's API pricing positions it as the premium standard frontier tier rather than a cost-leader. At $5 per 1M input tokens and $30 per 1M output tokens, it's 2x the input price of GPT-5.4 Standard and 2x the output price. Pro at $30/$180 per 1M tokens sits at the same headline rate as GPT-5.4 Pro. Both ship with a 1M-token context window, Batch and Flex pricing at half the standard rate, and Priority processing at 2.5x.
| Model | Input / Output (per 1M) | Context Window | Best Fit |
|---|---|---|---|
| gpt-5.5 | $5.00 / $30.00 | 1M | Default for agentic coding, computer use, knowledge work |
| gpt-5.5-pro | $30.00 / $180.00 | 1M | Deep research, math, BrowseComp-style retrieval |
| Codex (GPT-5.5) | Subscription tiers | 400K | Interactive coding; Fast mode 1.5x speed at 2.5x cost |
Today, GPT-5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. GPT-5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. API access on the Responses and Chat Completions endpoints is coming shortly — OpenAI cited additional safety and security work needed before serving the model at API scale, especially for partners integrating it into agent platforms. For Codex specifically, the new model is available across Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K-token context window.
For developers already integrated with Codex, the recommended starting point is documented in our Codex for almost everything release guide — same surface, lower token spend per task with GPT-5.5, and the option to flip Fast mode on for tight feedback loops.
Choosing GPT-5.5 vs. Alternatives
The frontier-model choice is increasingly task-shaped rather than vendor-shaped. GPT-5.5 leads on agentic coding, computer use, and cybersecurity; Claude Opus 4.7 remains strong on SWE-Bench Pro and certain autonomy-heavy refactors; Gemini 3.1 Pro leads on raw ARC-AGI-1 and competes hard on price for large-context workloads. Our broader frontier-model comparison is documented in the GPT-5.4 vs Opus 4.6 vs Gemini 3.1 Pro analysis — most of those decisions hold, with GPT-5.5 strengthening OpenAI's position on the agentic and computer-use axes.
GPT-5.5 standard, in Codex. State-of-the-art Terminal-Bench 2.0 and Expert-SWE scores, fewer tokens per task than GPT-5.4, optional Fast mode for interactive sessions, and the broadest production deployment story.
GPT-5.5 Pro. Leads BrowseComp at 90.1%, FrontierMath Tier 4 at 39.6%, and GeneBench at 33.2%. Best when the cost of an incorrect answer dwarfs the call cost — research synthesis, technical analysis, regulated-domain decisions.
GPT-5.5 standard. 78.7% OSWorld-Verified, native browser and desktop operation, and the strongest tool-orchestration scores OpenAI has published. Pair with sandboxing and human-in-the-loop checkpoints for production rollout.
Most production stacks land on a router that sends tasks to the best-fit model. GPT-5.5 is the default for agentic coding and computer use; Opus 4.7 stays a strong second opinion on certain SWE tasks; Gemini 3.1 Pro covers cost-sensitive long-context retrieval.
For teams currently on GPT-5.4, the migration is straightforward — same API contract, same Codex surface, lower per-task token spend on most workflows, and a meaningful jump in agentic capability. For teams primarily on Claude or Gemini, the question is whether GPT-5.5's lead on Terminal-Bench, Expert-SWE, GDPval, and OSWorld translates to lift on your specific evals — the answer is usually yes for agentic coding and computer use, often more nuanced for code generation and long-context retrieval where individual model strengths and price points still matter.
Conclusion
GPT-5.5 is the most consequential frontier-model release of the quarter. State-of-the-art agentic-coding scores, a 1M-token context window with strong long-context retrieval, native computer use that competes with the best published numbers, and per-token latency parity with GPT-5.4 add up to a model that materially changes what production agent systems can do — without changing their cost shape much for most workflows. Codex itself helped land the inference improvements that make this possible, which is increasingly the pattern: frontier models built and served with help from the previous generation of frontier models.
For most teams, the practical move is simple: standardize on GPT-5.5 for agentic coding, computer use, and knowledge-work agents, reserve GPT-5.5 Pro for deep research and the hardest evaluation-grade tasks, and keep a multi-model router in place to cover edge cases where Claude or Gemini still win on a specific metric. The cybersecurity posture change — High under Preparedness, stricter classifiers, Trusted Access for Cyber — is worth flagging to security teams now so they can route legitimate defensive use through the right channel rather than fight standard-tier guardrails. For an extended look at OpenAI's recent direction, our GPT-5.4 complete guide and Claude Opus 4.7 complete guide give the surrounding context.
Ready to Deploy Frontier AI in Production?
Choosing the right frontier model — and routing the right tasks to it — is now an architecture decision with measurable cost and capability impact. Our team helps businesses evaluate, integrate, and operate frontier models for agentic coding, computer use, and knowledge-work automation.
Frequently Asked Questions
Related Guides
Continue exploring frontier AI releases and agentic coding.