AI Development8 min read

GPT-5.3-Codex-Spark: 1,000 Tok/s Real-Time Coding

GPT-5.3-Codex-Spark delivers 1,000+ tokens/sec on Cerebras hardware with 77.3% Terminal-Bench. Benchmarks, speed-accuracy tradeoffs, and developer guide.

Digital Applied Team

February 18, 2026

8 min read

1,000+

Tokens Per Second

77.3%

Terminal-Bench 2.0

15x

Faster Than GPT-5.3-Codex

128K

Context Window (Tokens)

Key Takeaways

First real-time coding model: GPT-5.3-Codex-Spark delivers 1,000+ tokens per second, designed for interactive coding where responsiveness matters as much as intelligence.

Cerebras hardware partnership: First OpenAI model deployed outside NVIDIA infrastructure, running on Cerebras Wafer-Scale Engine 3 with 4 trillion transistors.

Strong coding performance: 77.3% on Terminal-Bench 2.0, matching GPT-5.3-Codex accuracy while running 15x faster. Time-to-first-token cut in half.

Research preview status: Available to ChatGPT Pro users through Codex app, CLI, and VS Code extension. Text-only with 128K context window.

80% lower latency overhead: WebSocket optimizations and Responses API improvements cut per-roundtrip overhead by 80% and per-token overhead by 30%.

OpenAI released GPT-5.3-Codex-Spark on February 12, 2026 — their first model designed specifically for real-time coding. This is not just a faster version of an existing model. Codex-Spark is fundamentally different: at 1,000+ tokens per second, it enables interactive pair-programming rather than the request-and-wait workflows that define current AI coding tools.

It is also the first OpenAI model deployed on Cerebras hardware instead of NVIDIA GPUs — a strategic shift that signals OpenAI's commitment to diversifying its inference infrastructure. The Cerebras Wafer-Scale Engine 3, with 4 trillion transistors across a single 46,255 mm² wafer, delivers the raw throughput needed to make real-time coding inference practical at scale.

OpenAI Codex Ecosystem

Codex-Spark is available as a research preview through the Codex app, CLI, and VS Code extension for ChatGPT Pro users. For the full GPT-5.3-Codex model, see our GPT-5.3-Codex benchmarks guide.

What Is GPT-5.3-Codex-Spark?

GPT-5.3-Codex-Spark is a smaller, speed-optimized version of GPT-5.3-Codex built for near-instant response in real-time software development workflows. Where the full Codex model excels at autonomous, multi-step coding tasks, Spark is designed for the back-and-forth rhythm of interactive development — inline suggestions, rapid edits, plan revisions, and contextual Q&A that happens in milliseconds rather than seconds.

The model runs exclusively on Cerebras Wafer-Scale Engine 3 hardware, making it the first OpenAI production model to deploy outside NVIDIA's GPU ecosystem. This hardware choice is not just a business decision — the WSE-3's architecture, with 900,000 cores on a single wafer, is specifically optimized for the kind of high-throughput, low-latency inference that real-time coding demands.

Speed

1,000+ tokens per second — 15x faster than GPT-5.3-Codex. Time-to-first-token reduced by 50%.

Accuracy

77.3% on Terminal-Bench 2.0, matching the full GPT-5.3-Codex model on coding benchmarks.

Use Cases

Inline suggestions, precise edits, plan revisions, contextual Q&A, and autocomplete-style code generation.

Benchmark Performance

The key story with Codex-Spark is not that it beats every model on every benchmark — it is that it matches GPT-5.3-Codex accuracy while running 15x faster. On Terminal-Bench 2.0, both models score 77.3%. On SWE-Bench Pro, Spark delivers strong performance that approaches the full Codex model. The speed advantage makes Spark practical for interactive workflows where waiting several seconds per response breaks the development flow.

Benchmark	Codex-Spark	GPT-5.3-Codex	GPT-5.2-Codex
Terminal-Bench 2.0	77.3%	77.3%	68.1%
SWE-Bench Pro	72.8%	75.1%	65.4%
Tokens/Second	1,000+	~67	~85
Time-to-First-Token	~120ms	~240ms	~200ms
Context Window	128K	128K	128K

The 15x speed advantage is transformative for interactive coding. At 67 tokens per second, GPT-5.3-Codex takes roughly 3 seconds to generate a 200-token response. At 1,000+ tokens per second, Codex-Spark delivers the same response in under 200 milliseconds — fast enough to feel like autocomplete rather than a chat exchange.

Benchmark Context

Benchmarks are reported by OpenAI based on internal evaluations. Results may vary with different prompting strategies and hardware configurations. For a broader comparison of coding AI models, see our Claude Opus 4.6 vs GPT-5.3-Codex comparison.

The Cerebras Hardware Story

Codex-Spark is the first milestone in the OpenAI-Cerebras partnership announced in January 2026. The model runs on Cerebras's third-generation Wafer-Scale Engine — the largest chip ever built for AI computation. The WSE-3 is not a GPU; it is a single wafer-scale processor designed from the ground up for neural network inference and training.

WSE-3 Specifications

The largest AI chip ever manufactured

46,255 mm² die area (single wafer)
4 trillion transistors
900,000 AI-optimized cores
125 petaflops peak performance

vs NVIDIA B200

How Cerebras compares to NVIDIA's latest

19x more transistors than B200
On-chip memory eliminates data movement bottlenecks
Single-chip design avoids multi-GPU communication overhead
Optimized for high-throughput inference workloads

The strategic significance extends beyond raw performance. By deploying a production model on Cerebras hardware, OpenAI is actively diversifying away from its near-total dependency on NVIDIA GPUs. This is a calculated move to reduce supply chain risk and negotiate better terms with hardware partners. For the broader AI industry, it validates Cerebras's wafer-scale approach as a viable alternative for inference workloads.

The partnership follows OpenAI's pattern of vertical integration — designing custom inference infrastructure rather than relying solely on commodity GPU clusters. As AI model inference costs become the dominant expense for production deployments, hardware specialization offers a path to significantly lower cost-per-token.

Speed vs Accuracy Tradeoffs

Codex-Spark is not a universal replacement for GPT-5.3-Codex. The two models serve different interaction patterns. Understanding when to use each model is critical for development teams integrating Codex into their workflows.

Spark Excels At

Interactive, real-time coding tasks

Precise single-file edits and refactors
Revising plans and implementation approaches
Contextual Q&A about codebases
Autocomplete-style code suggestions

Full Codex Better For

Complex, autonomous coding tasks

Complex multi-file refactors
Long-running agentic coding loops
Deep debugging across codebases
Architecture-level code generation

Latency Optimizations

Beyond raw token throughput, OpenAI introduced several infrastructure-level optimizations alongside Codex-Spark that reduce end-to-end latency for all Codex models:

Optimization	Improvement	Impact
WebSocket Optimizations	80% reduction	Per-roundtrip overhead
Responses API Improvements	30% reduction	Per-token overhead
Time-to-First-Token	50% reduction	Initial response latency

These optimizations compound with the hardware-level speed improvements. The net result is that a typical Codex-Spark interaction — send prompt, receive edited code — completes in under 300 milliseconds for most tasks. This is fast enough to integrate into keystroke-level IDE interactions without disrupting the developer's typing flow.

Where Spark Fits in the Codex Ecosystem

With Codex-Spark, OpenAI now offers three distinct coding models, each optimized for different development workflows. Understanding the ecosystem helps teams select the right model for each task type.

Model	Best For	Speed	Complexity
Codex-Spark	Interactive pair-programming	1,000+ tok/s	Single-file edits
GPT-5.3-Codex	Autonomous coding agents	~67 tok/s	Multi-file refactors
GPT-5.2-Codex	Cost-effective fallback	~85 tok/s	Standard tasks

The practical pattern for most development teams is to use Codex-Spark as the default for IDE integrations — inline completions, quick edits, and conversational coding — while routing complex multi-file tasks to the full GPT-5.3-Codex model. GPT-5.2-Codex serves as a cost-effective fallback for high-volume, lower-complexity tasks like code review suggestions and documentation generation.

Building AI-powered development workflows? Our AI & Digital Transformation team can help you integrate the right Codex models into your engineering workflows for maximum productivity gains.

Availability and Access

Codex-Spark launched on February 12, 2026 as a research preview available exclusively to ChatGPT Pro subscribers. Access is provided through three interfaces, all designed for developer workflows rather than general chat.

Codex App

Web-based coding environment

Codex CLI

Terminal-based coding assistant

VS Code

Extension for inline coding

Current Limitations

Text-only — No vision or image understanding capabilities. Code and text inputs only.
128K context window — Adequate for most single-file tasks but smaller than the 1M token windows offered by some competitors.
Research preview — Rate limits apply, and availability may be constrained during peak usage as Cerebras infrastructure scales.
ChatGPT Pro only — No API access, no free tier, no enterprise deployment options during the preview period.
Cerebras-only infrastructure — Currently limited to Cerebras WSE-3 hardware, which constrains geographic availability and total capacity.

OpenAI has not announced a general availability date or API access timeline. The company is working with Cerebras to expand datacenter capacity, but the wafer-scale manufacturing process limits how quickly supply can scale.

Developer Guide

Getting started with Codex-Spark requires a ChatGPT Pro subscription and one of three access methods. Here is how to set up each integration and when to use different speed and thinking tiers.

Getting Started

1. VS Code Extension

Install the OpenAI Codex extension from the VS Code marketplace. Sign in with your ChatGPT Pro account. Spark appears as the default model for inline completions and quick edits. The full Codex model handles multi-file tasks automatically.

2. Codex CLI

Install via npm install -g @openai/codex. Authenticate with your Pro account. The CLI automatically selects Spark for interactive mode and full Codex for autonomous tasks based on the command pattern.

3. Codex Web App

Access at codex.openai.com with your ChatGPT Pro account. Toggle between Spark and full Codex models in the model selector. The web app supports file uploads, project context, and collaborative editing sessions.

Choosing the Right Model

Task Type	Recommended Model	Why
Inline completions	Codex-Spark	Near-instant response needed
Bug fix in single file	Codex-Spark	Fast iteration on edits
Multi-file feature implementation	GPT-5.3-Codex	Needs deeper reasoning
Large-scale refactor	GPT-5.3-Codex	Complex dependency analysis
Code review comments	GPT-5.2-Codex	Cost-effective for volume

The ideal setup for most teams is to configure Codex-Spark as the default for VS Code inline completions and quick-edit panels, with the full GPT-5.3-Codex model available on-demand for complex tasks. This mirrors how many teams already use different AI models for different purposes — fast models for interactive work, powerful models for autonomous operations.

Conclusion

GPT-5.3-Codex-Spark marks a meaningful shift in how AI coding tools are built and deployed. By achieving GPT-5.3-Codex-level accuracy at 15x the speed on Cerebras hardware, OpenAI has created the first AI coding model that genuinely operates at the speed of thought. The 1,000+ tokens per second throughput enables interaction patterns — inline completions, rapid iteration, conversational debugging — that were previously impossible with frontier-class models.

The Cerebras partnership adds strategic depth beyond raw performance. OpenAI's willingness to deploy production models on non-NVIDIA hardware signals a maturing inference infrastructure strategy that prioritizes workload-specific optimization over commodity GPU scaling. For developers and engineering teams, the practical takeaway is clear: real-time AI pair-programming is no longer a research concept — it is a production capability.

Ready to Accelerate Your Development?

Whether you're integrating AI coding tools, building custom development workflows, or modernizing your engineering stack, our team can help you leverage the latest AI models for measurable productivity gains.

Get Started Explore Web Development Services

Free consultation

Expert AI integration guidance

Tailored solutions