AI Development11 min read

GPT-5.4: Computer Use, Tool Search, Benchmarks, Pricing

OpenAI releases GPT-5.4 with native computer use, 1M context, and tool search reducing tokens by 47%. Complete benchmarks, pricing, and developer guide.

Digital Applied Team

March 5, 2026

11 min read

83.0%

GDPval Score

75.0%

OSWorld (Computer Use)

Context Window

$2.50

Input Price per 1M

Key Takeaways

First general-purpose model with native computer use: GPT-5.4 achieves 75.0% on OSWorld-Verified, surpassing human performance at 72.4%. It can operate computers through both Playwright code and direct mouse/keyboard commands from screenshots, making it the strongest model available for building agents that complete real tasks across software systems.

83% GDPval score matching industry professionals: Across 44 occupations spanning the top 9 industries contributing to U.S. GDP, GPT-5.4 matches or exceeds human professional output in 83% of comparisons. Spreadsheet modeling scores jumped from 68.4% to 87.3%, and human raters preferred its presentations 68% of the time over GPT-5.2.

Tool search reduces token usage by 47%: A new API feature lets GPT-5.4 receive a lightweight tool list and look up full definitions on demand instead of loading all tool schemas upfront. On 250 MCP Atlas tasks with 36 MCP servers enabled, this cut total token usage by 47% while maintaining the same accuracy.

1M token context window in Codex and the API: GPT-5.4 supports up to 1 million tokens of context, enabling agents to plan, execute, and verify tasks across long horizons. Requests exceeding the standard 272K context window are billed at 2x the normal rate. It is also the most token-efficient reasoning model OpenAI has released.

OpenAI released GPT-5.4 on March 5, 2026, calling it their most capable and efficient frontier model for professional work. Available across ChatGPT, the API, and Codex, GPT-5.4 brings together reasoning, coding, computer use, and agentic workflows into a single model that marks a significant step beyond GPT-5.2 and the specialized GPT-5.3-Codex.

The headline capabilities include native computer use that surpasses human performance, a tool search feature that cuts token costs by nearly half for complex agent workflows, and a 1 million token context window in Codex. For developers and businesses building AI-powered applications, GPT-5.4 represents the strongest general-purpose model OpenAI has shipped. This guide covers the benchmarks, pricing, architecture changes, and migration considerations in detail. If you followed our GPT-5.4 preview analysis, this is the full release breakdown.

What Is GPT-5.4

GPT-5.4 is OpenAI's new flagship reasoning model, designed for professional work that requires sustained tool use, multi-step execution, and high-quality outputs across documents, code, and software environments. It replaces GPT-5.2 Thinking in ChatGPT and unifies the coding strengths of GPT-5.3-Codex with broader knowledge work and computer-use capabilities.

GPT-5.4 Key Specifications

Native computer use via Playwright code and screenshot-based mouse/keyboard commands
1M token context window in Codex (272K standard, 2x billing beyond)
Tool search for efficient large-scale tool ecosystems and MCP servers
Most token-efficient reasoning model yet (fewer tokens than GPT-5.2 for equivalent tasks)
Mid-response steerability with upfront plan preambles in ChatGPT
GPT-5.4 Pro variant for maximum performance on complex tasks

The naming jump from GPT-5.3-Codex to GPT-5.4 reflects the scope of improvement. OpenAI explicitly states this is their first mainline reasoning model that incorporates frontier coding capabilities while being deployed across all surfaces simultaneously. In the API, it is available as gpt-5.4 with a Pro variant at gpt-5.4-pro.

Knowledge Work Performance

GPT-5.4 sets a new state of the art on GDPval, OpenAI's benchmark that tests agents' abilities to produce well-specified knowledge work across 44 occupations from the top 9 industries contributing to U.S. GDP. Tasks include real work products such as sales presentations, accounting spreadsheets, urgent care schedules, manufacturing diagrams, and short videos.

Benchmark	GPT-5.2	GPT-5.4	Change
GDPval (wins or ties)	70.9%	83.0%	+12.1 pts
IB Modeling Tasks (Internal)	68.4%	87.3%	+18.9 pts
Presentation Preference	baseline	68.0% preferred	-
False Claims Reduction	baseline	33% fewer	-
OfficeQA	63.1%	68.1%	+5.0 pts

OpenAI put particular focus on improving spreadsheets, presentations, and documents. On an internal benchmark of spreadsheet modeling tasks a junior investment banking analyst might handle, GPT-5.4 achieved 87.3% compared to 68.4% for GPT-5.2. Human raters preferred GPT-5.4 presentations 68% of the time due to stronger aesthetics, greater visual variety, and more effective use of image generation.

GPT-5.4 is also the most factual model OpenAI has released. On a set of de-identified prompts where users flagged factual errors, individual claims are 33% less likely to be false and full responses are 18% less likely to contain any errors compared to GPT-5.2. For businesses that rely on accurate AI-generated content, this is a meaningful improvement in production reliability.

Computer Use and Vision

GPT-5.4 is the first general-purpose model OpenAI has released with native computer-use capabilities. It can write code to operate computers via libraries like Playwright and issue mouse and keyboard commands in response to screenshots. This dual approach makes it the strongest model currently available for developers building agents that complete real tasks across websites and software systems.

Benchmark	GPT-5.2	GPT-5.4	Human Baseline
OSWorld-Verified (desktop)	47.3%	75.0%	72.4%
WebArena-Verified (browser)	65.4%	67.3%	-
Online-Mind2Web (screenshots)	-	92.8%	-
MMMU Pro (no tools)	79.5%	81.2%	-
OmniDocBench (avg error)	0.140	0.109	-

The OSWorld result is particularly notable: GPT-5.4 surpasses human performance at 72.4% on desktop navigation tasks involving screenshots and keyboard/mouse actions. The jump from GPT-5.2's 47.3% to 75.0% represents a generational improvement in agentic desktop capabilities.

New Image Input Detail Levels

Original: Full-fidelity up to 10.24M total pixels or 6000px max dimension
High: Now supports up to 2.56M total pixels or 2048px max dimension
Strong gains in localization ability, image understanding, and click accuracy in early testing

Developers can access computer-use capabilities through the updated computer tool in the API. Behavior is steerable via developer messages, meaning agents can be tuned for specific use cases. Developers can also configure custom confirmation policies that specify different levels of risk tolerance for automated actions.

Coding Capabilities

GPT-5.4 combines the coding strengths of GPT-5.3-Codex with its broader knowledge work and computer-use capabilities. It matches or outperforms GPT-5.3-Codex on SWE-Bench Pro while offering lower latency across reasoning effort levels.

Benchmark	GPT-5.2	GPT-5.3-Codex	GPT-5.4
SWE-Bench Pro (Public)	55.6%	56.8%	57.7%
Terminal-Bench 2.0	62.2%	77.3%	75.1%

In Codex, the /fast mode delivers up to 1.5x faster token velocity with GPT-5.4. It uses the same model and intelligence, just with faster output. Developers can access the same speeds via the API using priority processing.

Playwright Interactive Skill

OpenAI released an experimental Codex skill called "Playwright (Interactive)" that demonstrates computer use and coding working in tandem:

Visually debug web and Electron apps during development
Test an app it is building, as it is building it
Excels at complex frontend tasks with more aesthetic and functional results

OpenAI demonstrated this with a theme park simulation game built from a single prompt, using Playwright Interactive for browser playtesting and image generation for isometric assets. The simulation includes tile-based path placement, ride construction, guest pathfinding, and live park metrics. For teams building web applications, this combination of visual debugging and code generation points toward a future where AI agents can iterate on frontend work with minimal human intervention.

Building AI agents with GPT-5.4? Our team designs and deploys AI-powered workflows using the latest models, from agentic automation to computer-use agents. Explore our AI & Digital Transformation Services.

Tool Search and Agentic Capabilities

GPT-5.4 introduces tool search in the API, a structural change to how models work with external tools. Previously, all tool definitions were included in the prompt upfront. For systems with many tools, this could add thousands or tens of thousands of tokens to every request, increasing cost, slowing responses, and crowding the context with information the model might never use.

Before: All Tools Upfront

All tool definitions in every prompt
Thousands of tokens consumed upfront
Cache invalidated by tool list changes
Context crowded with unused definitions

After: Tool Search

Lightweight tool list with search capability
Full definitions loaded on demand
Cache preserved across requests
47% token reduction, same accuracy

On 250 tasks from Scale's MCP Atlas benchmark with all 36 MCP servers enabled, tool search reduced total token usage by 47% while achieving the same accuracy. For MCP servers that may contain tens of thousands of tokens of tool definitions, the efficiency gains are substantial. This is particularly relevant for enterprise deployments where agents need access to large connector ecosystems.

Agentic Tool Calling

Benchmark	GPT-5.2	GPT-5.4
Toolathlon	45.7%	54.6%
MCP Atlas	60.6%	67.2%
BrowseComp	65.8%	82.7%
BrowseComp (Pro)	77.9%	89.3%
τ2-bench Telecom (no reasoning)	57.2%	64.3%

GPT-5.4 achieves higher accuracy in fewer turns on Toolathlon, a benchmark testing how well AI agents use real-world tools and APIs to complete multi-step tasks like reading emails, grading assignments, and recording results in spreadsheets. The BrowseComp improvement is especially striking: a 17-point jump over GPT-5.2, with GPT-5.4 Pro setting a new state of the art at 89.3%.

Steerability and Mid-Response Guidance

In ChatGPT, GPT-5.4 Thinking introduces a preamble feature that outlines the model's planned approach before executing on complex queries. Users can add instructions or adjust direction mid-response, steering the model toward the exact outcome they want without starting over or requiring multiple additional turns.

Steerability Improvements

Upfront plan preamble. Model outlines its approach before executing, similar to how Codex works
Mid-response course correction. Adjust instructions while the model is still working
Longer thinking with better context. Model maintains stronger awareness of earlier steps across complex workflows

This feature is available now on chatgpt.com and the Android app, with iOS support coming soon. For professionals using ChatGPT for complex document creation, research synthesis, or multi-step analysis, the ability to steer mid-response means fewer wasted iterations and faster convergence on the desired output.

Pricing and Availability

GPT-5.4 is priced higher per token than GPT-5.2 to reflect its improved capabilities, while its greater token efficiency helps reduce the total number of tokens required for many tasks. Batch and Flex pricing are available at half the standard API rate, while Priority processing is available at twice the standard rate.

API Model	Input	Cached Input	Output
gpt-5.2	$1.75/M	$0.175/M	$14/M
gpt-5.4	$2.50/M	$0.25/M	$15/M
gpt-5.2-pro	$21/M	-	$168/M
gpt-5.4-pro	$30/M	-	$180/M

Availability Summary

ChatGPT: Plus, Team, and Pro users (GPT-5.4 Thinking replaces GPT-5.2 Thinking)
GPT-5.4 Pro: Available to Pro and Enterprise plans in ChatGPT and API
Enterprise/Edu: Enable early access via admin settings
GPT-5.2 retirement: Available in Legacy Models for 3 months, retires June 5, 2026
Codex: 1M context experimental (2x rate beyond 272K standard window)

The per-token input price increase from $1.75 to $2.50 (a 43% bump) is partially offset by GPT-5.4's improved token efficiency. OpenAI states it uses significantly fewer reasoning tokens than GPT-5.2 to solve equivalent problems. For workloads that were previously expensive due to long reasoning chains, the total cost may be comparable or lower. Developers should benchmark their specific use cases to determine net cost impact.

Full Benchmark Comparison

The table below consolidates GPT-5.4's performance across academic, reasoning, and long-context benchmarks. All evaluations were run with reasoning effort set to xhigh unless noted otherwise.

Evaluation	GPT-5.2	GPT-5.4	GPT-5.4 Pro
GPQA Diamond	92.4%	92.8%	94.4%
Humanity's Last Exam (with tools)	45.5%	52.1%	58.7%
Frontier Science Research	25.2%	33.0%	36.7%
FrontierMath Tier 1-3	40.7%	47.6%	50.0%
FrontierMath Tier 4	18.8%	27.1%	38.0%
ARC-AGI-1 (Verified)	86.2%	93.7%	94.5%
ARC-AGI-2 (Verified)	52.9%	73.3%	83.3%

The abstract reasoning improvements are substantial. ARC-AGI-2 jumps from 52.9% to 73.3% (GPT-5.4 Pro reaches 83.3%), demonstrating that GPT-5.4 represents a genuine reasoning advance, not just a tool-use wrapper around GPT-5.3-Codex. FrontierMath Tier 4, the hardest tier of mathematical reasoning, improves from 18.8% to 27.1%, with Pro reaching 38.0%.

Long Context Performance

GPT-5.4 is the first OpenAI model to support context lengths beyond 256K tokens. On Graphwalks BFS at the 0-128K range, it achieves 93.0% accuracy, comparable to GPT-5.2's 94.0%. At the new 256K-1M range that only GPT-5.4 supports, it achieves 21.4% on Graphwalks BFS and 32.4% on Graphwalks parents, reflecting the challenges of retrieval at extreme context lengths. On the MRCR needle-retrieval benchmark, performance remains strong through 128K (86.0%) and declines at the 512K-1M range (36.6%), which is typical of current-generation models operating at these scales.

Build with the Latest AI Models

Our team integrates cutting-edge AI models into production applications, from GPT-5.4 agentic workflows to custom multi-model architectures optimized for cost, speed, and quality.

Get Started Explore AI Services

Free consultation

Expert guidance

Tailored solutions