GPT-5.3-Codex-Spark: 1,000 Tok/s Real-Time Coding
GPT-5.3-Codex-Spark delivers 1,000+ tokens/sec on Cerebras hardware with 77.3% Terminal-Bench. Benchmarks, speed-accuracy tradeoffs, and developer guide.
Tokens Per Second
Terminal-Bench 2.0
Faster Than GPT-5.3-Codex
Context Window (Tokens)
Key Takeaways
OpenAI released GPT-5.3-Codex-Spark on February 12, 2026 — their first model designed specifically for real-time coding. This is not just a faster version of an existing model. Codex-Spark is fundamentally different: at 1,000+ tokens per second, it enables interactive pair-programming rather than the request-and-wait workflows that define current AI coding tools.
It is also the first OpenAI model deployed on Cerebras hardware instead of NVIDIA GPUs — a strategic shift that signals OpenAI's commitment to diversifying its inference infrastructure. The Cerebras Wafer-Scale Engine 3, with 4 trillion transistors across a single 46,255 mm² wafer, delivers the raw throughput needed to make real-time coding inference practical at scale.
What Is GPT-5.3-Codex-Spark?
GPT-5.3-Codex-Spark is a smaller, speed-optimized version of GPT-5.3-Codex built for near-instant response in real-time software development workflows. Where the full Codex model excels at autonomous, multi-step coding tasks, Spark is designed for the back-and-forth rhythm of interactive development — inline suggestions, rapid edits, plan revisions, and contextual Q&A that happens in milliseconds rather than seconds.
The model runs exclusively on Cerebras Wafer-Scale Engine 3 hardware, making it the first OpenAI production model to deploy outside NVIDIA's GPU ecosystem. This hardware choice is not just a business decision — the WSE-3's architecture, with 900,000 cores on a single wafer, is specifically optimized for the kind of high-throughput, low-latency inference that real-time coding demands.
1,000+ tokens per second — 15x faster than GPT-5.3-Codex. Time-to-first-token reduced by 50%.
77.3% on Terminal-Bench 2.0, matching the full GPT-5.3-Codex model on coding benchmarks.
Inline suggestions, precise edits, plan revisions, contextual Q&A, and autocomplete-style code generation.
Benchmark Performance
The key story with Codex-Spark is not that it beats every model on every benchmark — it is that it matches GPT-5.3-Codex accuracy while running 15x faster. On Terminal-Bench 2.0, both models score 77.3%. On SWE-Bench Pro, Spark delivers strong performance that approaches the full Codex model. The speed advantage makes Spark practical for interactive workflows where waiting several seconds per response breaks the development flow.
| Benchmark | Codex-Spark | GPT-5.3-Codex | GPT-5.2-Codex |
|---|---|---|---|
| Terminal-Bench 2.0 | 77.3% | 77.3% | 68.1% |
| SWE-Bench Pro | 72.8% | 75.1% | 65.4% |
| Tokens/Second | 1,000+ | ~67 | ~85 |
| Time-to-First-Token | ~120ms | ~240ms | ~200ms |
| Context Window | 128K | 128K | 128K |
The 15x speed advantage is transformative for interactive coding. At 67 tokens per second, GPT-5.3-Codex takes roughly 3 seconds to generate a 200-token response. At 1,000+ tokens per second, Codex-Spark delivers the same response in under 200 milliseconds — fast enough to feel like autocomplete rather than a chat exchange.
The Cerebras Hardware Story
Codex-Spark is the first milestone in the OpenAI-Cerebras partnership announced in January 2026. The model runs on Cerebras's third-generation Wafer-Scale Engine — the largest chip ever built for AI computation. The WSE-3 is not a GPU; it is a single wafer-scale processor designed from the ground up for neural network inference and training.
- 46,255 mm² die area (single wafer)
- 4 trillion transistors
- 900,000 AI-optimized cores
- 125 petaflops peak performance
- 19x more transistors than B200
- On-chip memory eliminates data movement bottlenecks
- Single-chip design avoids multi-GPU communication overhead
- Optimized for high-throughput inference workloads
The strategic significance extends beyond raw performance. By deploying a production model on Cerebras hardware, OpenAI is actively diversifying away from its near-total dependency on NVIDIA GPUs. This is a calculated move to reduce supply chain risk and negotiate better terms with hardware partners. For the broader AI industry, it validates Cerebras's wafer-scale approach as a viable alternative for inference workloads.
The partnership follows OpenAI's pattern of vertical integration — designing custom inference infrastructure rather than relying solely on commodity GPU clusters. As AI model inference costs become the dominant expense for production deployments, hardware specialization offers a path to significantly lower cost-per-token.
Speed vs Accuracy Tradeoffs
Codex-Spark is not a universal replacement for GPT-5.3-Codex. The two models serve different interaction patterns. Understanding when to use each model is critical for development teams integrating Codex into their workflows.
- Precise single-file edits and refactors
- Revising plans and implementation approaches
- Contextual Q&A about codebases
- Autocomplete-style code suggestions
- Complex multi-file refactors
- Long-running agentic coding loops
- Deep debugging across codebases
- Architecture-level code generation
Latency Optimizations
Beyond raw token throughput, OpenAI introduced several infrastructure-level optimizations alongside Codex-Spark that reduce end-to-end latency for all Codex models:
| Optimization | Improvement | Impact |
|---|---|---|
| WebSocket Optimizations | 80% reduction | Per-roundtrip overhead |
| Responses API Improvements | 30% reduction | Per-token overhead |
| Time-to-First-Token | 50% reduction | Initial response latency |
These optimizations compound with the hardware-level speed improvements. The net result is that a typical Codex-Spark interaction — send prompt, receive edited code — completes in under 300 milliseconds for most tasks. This is fast enough to integrate into keystroke-level IDE interactions without disrupting the developer's typing flow.
Where Spark Fits in the Codex Ecosystem
With Codex-Spark, OpenAI now offers three distinct coding models, each optimized for different development workflows. Understanding the ecosystem helps teams select the right model for each task type.
| Model | Best For | Speed | Complexity |
|---|---|---|---|
| Codex-Spark | Interactive pair-programming | 1,000+ tok/s | Single-file edits |
| GPT-5.3-Codex | Autonomous coding agents | ~67 tok/s | Multi-file refactors |
| GPT-5.2-Codex | Cost-effective fallback | ~85 tok/s | Standard tasks |
The practical pattern for most development teams is to use Codex-Spark as the default for IDE integrations — inline completions, quick edits, and conversational coding — while routing complex multi-file tasks to the full GPT-5.3-Codex model. GPT-5.2-Codex serves as a cost-effective fallback for high-volume, lower-complexity tasks like code review suggestions and documentation generation.
Availability and Access
Codex-Spark launched on February 12, 2026 as a research preview available exclusively to ChatGPT Pro subscribers. Access is provided through three interfaces, all designed for developer workflows rather than general chat.
Codex App
Web-based coding environment
Codex CLI
Terminal-based coding assistant
VS Code
Extension for inline coding
Current Limitations
- Text-only — No vision or image understanding capabilities. Code and text inputs only.
- 128K context window — Adequate for most single-file tasks but smaller than the 1M token windows offered by some competitors.
- Research preview — Rate limits apply, and availability may be constrained during peak usage as Cerebras infrastructure scales.
- ChatGPT Pro only — No API access, no free tier, no enterprise deployment options during the preview period.
- Cerebras-only infrastructure — Currently limited to Cerebras WSE-3 hardware, which constrains geographic availability and total capacity.
OpenAI has not announced a general availability date or API access timeline. The company is working with Cerebras to expand datacenter capacity, but the wafer-scale manufacturing process limits how quickly supply can scale.
Developer Guide
Getting started with Codex-Spark requires a ChatGPT Pro subscription and one of three access methods. Here is how to set up each integration and when to use different speed and thinking tiers.
Getting Started
1. VS Code Extension
Install the OpenAI Codex extension from the VS Code marketplace. Sign in with your ChatGPT Pro account. Spark appears as the default model for inline completions and quick edits. The full Codex model handles multi-file tasks automatically.
2. Codex CLI
Install via npm install -g @openai/codex. Authenticate with your Pro account. The CLI automatically selects Spark for interactive mode and full Codex for autonomous tasks based on the command pattern.
3. Codex Web App
Access at codex.openai.com with your ChatGPT Pro account. Toggle between Spark and full Codex models in the model selector. The web app supports file uploads, project context, and collaborative editing sessions.
Choosing the Right Model
| Task Type | Recommended Model | Why |
|---|---|---|
| Inline completions | Codex-Spark | Near-instant response needed |
| Bug fix in single file | Codex-Spark | Fast iteration on edits |
| Multi-file feature implementation | GPT-5.3-Codex | Needs deeper reasoning |
| Large-scale refactor | GPT-5.3-Codex | Complex dependency analysis |
| Code review comments | GPT-5.2-Codex | Cost-effective for volume |
The ideal setup for most teams is to configure Codex-Spark as the default for VS Code inline completions and quick-edit panels, with the full GPT-5.3-Codex model available on-demand for complex tasks. This mirrors how many teams already use different AI models for different purposes — fast models for interactive work, powerful models for autonomous operations.
Conclusion
GPT-5.3-Codex-Spark marks a meaningful shift in how AI coding tools are built and deployed. By achieving GPT-5.3-Codex-level accuracy at 15x the speed on Cerebras hardware, OpenAI has created the first AI coding model that genuinely operates at the speed of thought. The 1,000+ tokens per second throughput enables interaction patterns — inline completions, rapid iteration, conversational debugging — that were previously impossible with frontier-class models.
The Cerebras partnership adds strategic depth beyond raw performance. OpenAI's willingness to deploy production models on non-NVIDIA hardware signals a maturing inference infrastructure strategy that prioritizes workload-specific optimization over commodity GPU scaling. For developers and engineering teams, the practical takeaway is clear: real-time AI pair-programming is no longer a research concept — it is a production capability.
Ready to Accelerate Your Development?
Whether you're integrating AI coding tools, building custom development workflows, or modernizing your engineering stack, our team can help you leverage the latest AI models for measurable productivity gains.
Frequently Asked Questions
Related Guides
Continue exploring AI coding models and developer tools