GPT-5.4: Computer Use, Tool Search, Benchmarks, Pricing
OpenAI releases GPT-5.4 with native computer use, 1M context, and tool search reducing tokens by 47%. Complete benchmarks, pricing, and developer guide.
GDPval Score
OSWorld (Computer Use)
Context Window
Input Price per 1M
Key Takeaways
OpenAI released GPT-5.4 on March 5, 2026, calling it their most capable and efficient frontier model for professional work. Available across ChatGPT, the API, and Codex, GPT-5.4 brings together reasoning, coding, computer use, and agentic workflows into a single model that marks a significant step beyond GPT-5.2 and the specialized GPT-5.3-Codex.
The headline capabilities include native computer use that surpasses human performance, a tool search feature that cuts token costs by nearly half for complex agent workflows, and a 1 million token context window in Codex. For developers and businesses building AI-powered applications, GPT-5.4 represents the strongest general-purpose model OpenAI has shipped. This guide covers the benchmarks, pricing, architecture changes, and migration considerations in detail. If you followed our GPT-5.4 preview analysis, this is the full release breakdown.
What Is GPT-5.4
GPT-5.4 is OpenAI's new flagship reasoning model, designed for professional work that requires sustained tool use, multi-step execution, and high-quality outputs across documents, code, and software environments. It replaces GPT-5.2 Thinking in ChatGPT and unifies the coding strengths of GPT-5.3-Codex with broader knowledge work and computer-use capabilities.
- Native computer use via Playwright code and screenshot-based mouse/keyboard commands
- 1M token context window in Codex (272K standard, 2x billing beyond)
- Tool search for efficient large-scale tool ecosystems and MCP servers
- Most token-efficient reasoning model yet (fewer tokens than GPT-5.2 for equivalent tasks)
- Mid-response steerability with upfront plan preambles in ChatGPT
- GPT-5.4 Pro variant for maximum performance on complex tasks
The naming jump from GPT-5.3-Codex to GPT-5.4 reflects the scope of improvement. OpenAI explicitly states this is their first mainline reasoning model that incorporates frontier coding capabilities while being deployed across all surfaces simultaneously. In the API, it is available as gpt-5.4 with a Pro variant at gpt-5.4-pro.
Knowledge Work Performance
GPT-5.4 sets a new state of the art on GDPval, OpenAI's benchmark that tests agents' abilities to produce well-specified knowledge work across 44 occupations from the top 9 industries contributing to U.S. GDP. Tasks include real work products such as sales presentations, accounting spreadsheets, urgent care schedules, manufacturing diagrams, and short videos.
| Benchmark | GPT-5.2 | GPT-5.4 | Change |
|---|---|---|---|
| GDPval (wins or ties) | 70.9% | 83.0% | +12.1 pts |
| IB Modeling Tasks (Internal) | 68.4% | 87.3% | +18.9 pts |
| Presentation Preference | baseline | 68.0% preferred | - |
| False Claims Reduction | baseline | 33% fewer | - |
| OfficeQA | 63.1% | 68.1% | +5.0 pts |
OpenAI put particular focus on improving spreadsheets, presentations, and documents. On an internal benchmark of spreadsheet modeling tasks a junior investment banking analyst might handle, GPT-5.4 achieved 87.3% compared to 68.4% for GPT-5.2. Human raters preferred GPT-5.4 presentations 68% of the time due to stronger aesthetics, greater visual variety, and more effective use of image generation.
GPT-5.4 is also the most factual model OpenAI has released. On a set of de-identified prompts where users flagged factual errors, individual claims are 33% less likely to be false and full responses are 18% less likely to contain any errors compared to GPT-5.2. For businesses that rely on accurate AI-generated content, this is a meaningful improvement in production reliability.
Computer Use and Vision
GPT-5.4 is the first general-purpose model OpenAI has released with native computer-use capabilities. It can write code to operate computers via libraries like Playwright and issue mouse and keyboard commands in response to screenshots. This dual approach makes it the strongest model currently available for developers building agents that complete real tasks across websites and software systems.
| Benchmark | GPT-5.2 | GPT-5.4 | Human Baseline |
|---|---|---|---|
| OSWorld-Verified (desktop) | 47.3% | 75.0% | 72.4% |
| WebArena-Verified (browser) | 65.4% | 67.3% | - |
| Online-Mind2Web (screenshots) | - | 92.8% | - |
| MMMU Pro (no tools) | 79.5% | 81.2% | - |
| OmniDocBench (avg error) | 0.140 | 0.109 | - |
The OSWorld result is particularly notable: GPT-5.4 surpasses human performance at 72.4% on desktop navigation tasks involving screenshots and keyboard/mouse actions. The jump from GPT-5.2's 47.3% to 75.0% represents a generational improvement in agentic desktop capabilities.
- Original: Full-fidelity up to 10.24M total pixels or 6000px max dimension
- High: Now supports up to 2.56M total pixels or 2048px max dimension
- Strong gains in localization ability, image understanding, and click accuracy in early testing
Developers can access computer-use capabilities through the updated computer tool in the API. Behavior is steerable via developer messages, meaning agents can be tuned for specific use cases. Developers can also configure custom confirmation policies that specify different levels of risk tolerance for automated actions.
Coding Capabilities
GPT-5.4 combines the coding strengths of GPT-5.3-Codex with its broader knowledge work and computer-use capabilities. It matches or outperforms GPT-5.3-Codex on SWE-Bench Pro while offering lower latency across reasoning effort levels.
| Benchmark | GPT-5.2 | GPT-5.3-Codex | GPT-5.4 |
|---|---|---|---|
| SWE-Bench Pro (Public) | 55.6% | 56.8% | 57.7% |
| Terminal-Bench 2.0 | 62.2% | 77.3% | 75.1% |
In Codex, the /fast mode delivers up to 1.5x faster token velocity with GPT-5.4. It uses the same model and intelligence, just with faster output. Developers can access the same speeds via the API using priority processing.
OpenAI released an experimental Codex skill called "Playwright (Interactive)" that demonstrates computer use and coding working in tandem:
- Visually debug web and Electron apps during development
- Test an app it is building, as it is building it
- Excels at complex frontend tasks with more aesthetic and functional results
OpenAI demonstrated this with a theme park simulation game built from a single prompt, using Playwright Interactive for browser playtesting and image generation for isometric assets. The simulation includes tile-based path placement, ride construction, guest pathfinding, and live park metrics. For teams building web applications, this combination of visual debugging and code generation points toward a future where AI agents can iterate on frontend work with minimal human intervention.
Tool Search and Agentic Capabilities
GPT-5.4 introduces tool search in the API, a structural change to how models work with external tools. Previously, all tool definitions were included in the prompt upfront. For systems with many tools, this could add thousands or tens of thousands of tokens to every request, increasing cost, slowing responses, and crowding the context with information the model might never use.
- All tool definitions in every prompt
- Thousands of tokens consumed upfront
- Cache invalidated by tool list changes
- Context crowded with unused definitions
- Lightweight tool list with search capability
- Full definitions loaded on demand
- Cache preserved across requests
- 47% token reduction, same accuracy
On 250 tasks from Scale's MCP Atlas benchmark with all 36 MCP servers enabled, tool search reduced total token usage by 47% while achieving the same accuracy. For MCP servers that may contain tens of thousands of tokens of tool definitions, the efficiency gains are substantial. This is particularly relevant for enterprise deployments where agents need access to large connector ecosystems.
Agentic Tool Calling
| Benchmark | GPT-5.2 | GPT-5.4 |
|---|---|---|
| Toolathlon | 45.7% | 54.6% |
| MCP Atlas | 60.6% | 67.2% |
| BrowseComp | 65.8% | 82.7% |
| BrowseComp (Pro) | 77.9% | 89.3% |
| τ2-bench Telecom (no reasoning) | 57.2% | 64.3% |
GPT-5.4 achieves higher accuracy in fewer turns on Toolathlon, a benchmark testing how well AI agents use real-world tools and APIs to complete multi-step tasks like reading emails, grading assignments, and recording results in spreadsheets. The BrowseComp improvement is especially striking: a 17-point jump over GPT-5.2, with GPT-5.4 Pro setting a new state of the art at 89.3%.
Steerability and Mid-Response Guidance
In ChatGPT, GPT-5.4 Thinking introduces a preamble feature that outlines the model's planned approach before executing on complex queries. Users can add instructions or adjust direction mid-response, steering the model toward the exact outcome they want without starting over or requiring multiple additional turns.
- Upfront plan preamble. Model outlines its approach before executing, similar to how Codex works
- Mid-response course correction. Adjust instructions while the model is still working
- Longer thinking with better context. Model maintains stronger awareness of earlier steps across complex workflows
This feature is available now on chatgpt.com and the Android app, with iOS support coming soon. For professionals using ChatGPT for complex document creation, research synthesis, or multi-step analysis, the ability to steer mid-response means fewer wasted iterations and faster convergence on the desired output.
Pricing and Availability
GPT-5.4 is priced higher per token than GPT-5.2 to reflect its improved capabilities, while its greater token efficiency helps reduce the total number of tokens required for many tasks. Batch and Flex pricing are available at half the standard API rate, while Priority processing is available at twice the standard rate.
| API Model | Input | Cached Input | Output |
|---|---|---|---|
| gpt-5.2 | $1.75/M | $0.175/M | $14/M |
| gpt-5.4 | $2.50/M | $0.25/M | $15/M |
| gpt-5.2-pro | $21/M | - | $168/M |
| gpt-5.4-pro | $30/M | - | $180/M |
- ChatGPT: Plus, Team, and Pro users (GPT-5.4 Thinking replaces GPT-5.2 Thinking)
- GPT-5.4 Pro: Available to Pro and Enterprise plans in ChatGPT and API
- Enterprise/Edu: Enable early access via admin settings
- GPT-5.2 retirement: Available in Legacy Models for 3 months, retires June 5, 2026
- Codex: 1M context experimental (2x rate beyond 272K standard window)
The per-token input price increase from $1.75 to $2.50 (a 43% bump) is partially offset by GPT-5.4's improved token efficiency. OpenAI states it uses significantly fewer reasoning tokens than GPT-5.2 to solve equivalent problems. For workloads that were previously expensive due to long reasoning chains, the total cost may be comparable or lower. Developers should benchmark their specific use cases to determine net cost impact.
Full Benchmark Comparison
The table below consolidates GPT-5.4's performance across academic, reasoning, and long-context benchmarks. All evaluations were run with reasoning effort set to xhigh unless noted otherwise.
| Evaluation | GPT-5.2 | GPT-5.4 | GPT-5.4 Pro |
|---|---|---|---|
| GPQA Diamond | 92.4% | 92.8% | 94.4% |
| Humanity's Last Exam (with tools) | 45.5% | 52.1% | 58.7% |
| Frontier Science Research | 25.2% | 33.0% | 36.7% |
| FrontierMath Tier 1-3 | 40.7% | 47.6% | 50.0% |
| FrontierMath Tier 4 | 18.8% | 27.1% | 38.0% |
| ARC-AGI-1 (Verified) | 86.2% | 93.7% | 94.5% |
| ARC-AGI-2 (Verified) | 52.9% | 73.3% | 83.3% |
The abstract reasoning improvements are substantial. ARC-AGI-2 jumps from 52.9% to 73.3% (GPT-5.4 Pro reaches 83.3%), demonstrating that GPT-5.4 represents a genuine reasoning advance, not just a tool-use wrapper around GPT-5.3-Codex. FrontierMath Tier 4, the hardest tier of mathematical reasoning, improves from 18.8% to 27.1%, with Pro reaching 38.0%.
Long Context Performance
GPT-5.4 is the first OpenAI model to support context lengths beyond 256K tokens. On Graphwalks BFS at the 0-128K range, it achieves 93.0% accuracy, comparable to GPT-5.2's 94.0%. At the new 256K-1M range that only GPT-5.4 supports, it achieves 21.4% on Graphwalks BFS and 32.4% on Graphwalks parents, reflecting the challenges of retrieval at extreme context lengths. On the MRCR needle-retrieval benchmark, performance remains strong through 128K (86.0%) and declines at the 512K-1M range (36.6%), which is typical of current-generation models operating at these scales.
Build with the Latest AI Models
Our team integrates cutting-edge AI models into production applications, from GPT-5.4 agentic workflows to custom multi-model architectures optimized for cost, speed, and quality.
Frequently Asked Questions
Related Guides
Continue exploring these insights and strategies.