AI Development8 min read

GPT-5.3-Codex-Spark: 1,000 Tok/s Real-Time Coding

GPT-5.3-Codex-Spark delivers 1,000+ tokens/sec on Cerebras hardware with 77.3% Terminal-Bench. Benchmarks, speed-accuracy tradeoffs, and developer guide.

Digital Applied Team
February 18, 2026
8 min read
1,000+

Tokens Per Second

77.3%

Terminal-Bench 2.0

15x

Faster Than GPT-5.3-Codex

128K

Context Window (Tokens)

Key Takeaways

First real-time coding model: GPT-5.3-Codex-Spark delivers 1,000+ tokens per second, designed for interactive coding where responsiveness matters as much as intelligence.
Cerebras hardware partnership: First OpenAI model deployed outside NVIDIA infrastructure, running on Cerebras Wafer-Scale Engine 3 with 4 trillion transistors.
Strong coding performance: 77.3% on Terminal-Bench 2.0, matching GPT-5.3-Codex accuracy while running 15x faster. Time-to-first-token cut in half.
Research preview status: Available to ChatGPT Pro users through Codex app, CLI, and VS Code extension. Text-only with 128K context window.
80% lower latency overhead: WebSocket optimizations and Responses API improvements cut per-roundtrip overhead by 80% and per-token overhead by 30%.

OpenAI released GPT-5.3-Codex-Spark on February 12, 2026 — their first model designed specifically for real-time coding. This is not just a faster version of an existing model. Codex-Spark is fundamentally different: at 1,000+ tokens per second, it enables interactive pair-programming rather than the request-and-wait workflows that define current AI coding tools.

It is also the first OpenAI model deployed on Cerebras hardware instead of NVIDIA GPUs — a strategic shift that signals OpenAI's commitment to diversifying its inference infrastructure. The Cerebras Wafer-Scale Engine 3, with 4 trillion transistors across a single 46,255 mm² wafer, delivers the raw throughput needed to make real-time coding inference practical at scale.

What Is GPT-5.3-Codex-Spark?

GPT-5.3-Codex-Spark is a smaller, speed-optimized version of GPT-5.3-Codex built for near-instant response in real-time software development workflows. Where the full Codex model excels at autonomous, multi-step coding tasks, Spark is designed for the back-and-forth rhythm of interactive development — inline suggestions, rapid edits, plan revisions, and contextual Q&A that happens in milliseconds rather than seconds.

The model runs exclusively on Cerebras Wafer-Scale Engine 3 hardware, making it the first OpenAI production model to deploy outside NVIDIA's GPU ecosystem. This hardware choice is not just a business decision — the WSE-3's architecture, with 900,000 cores on a single wafer, is specifically optimized for the kind of high-throughput, low-latency inference that real-time coding demands.

Speed

1,000+ tokens per second — 15x faster than GPT-5.3-Codex. Time-to-first-token reduced by 50%.

Accuracy

77.3% on Terminal-Bench 2.0, matching the full GPT-5.3-Codex model on coding benchmarks.

Use Cases

Inline suggestions, precise edits, plan revisions, contextual Q&A, and autocomplete-style code generation.

Benchmark Performance

The key story with Codex-Spark is not that it beats every model on every benchmark — it is that it matches GPT-5.3-Codex accuracy while running 15x faster. On Terminal-Bench 2.0, both models score 77.3%. On SWE-Bench Pro, Spark delivers strong performance that approaches the full Codex model. The speed advantage makes Spark practical for interactive workflows where waiting several seconds per response breaks the development flow.

BenchmarkCodex-SparkGPT-5.3-CodexGPT-5.2-Codex
Terminal-Bench 2.077.3%77.3%68.1%
SWE-Bench Pro72.8%75.1%65.4%
Tokens/Second1,000+~67~85
Time-to-First-Token~120ms~240ms~200ms
Context Window128K128K128K

The 15x speed advantage is transformative for interactive coding. At 67 tokens per second, GPT-5.3-Codex takes roughly 3 seconds to generate a 200-token response. At 1,000+ tokens per second, Codex-Spark delivers the same response in under 200 milliseconds — fast enough to feel like autocomplete rather than a chat exchange.

The Cerebras Hardware Story

Codex-Spark is the first milestone in the OpenAI-Cerebras partnership announced in January 2026. The model runs on Cerebras's third-generation Wafer-Scale Engine — the largest chip ever built for AI computation. The WSE-3 is not a GPU; it is a single wafer-scale processor designed from the ground up for neural network inference and training.

WSE-3 Specifications
The largest AI chip ever manufactured
  • 46,255 mm² die area (single wafer)
  • 4 trillion transistors
  • 900,000 AI-optimized cores
  • 125 petaflops peak performance
vs NVIDIA B200
How Cerebras compares to NVIDIA's latest
  • 19x more transistors than B200
  • On-chip memory eliminates data movement bottlenecks
  • Single-chip design avoids multi-GPU communication overhead
  • Optimized for high-throughput inference workloads

The strategic significance extends beyond raw performance. By deploying a production model on Cerebras hardware, OpenAI is actively diversifying away from its near-total dependency on NVIDIA GPUs. This is a calculated move to reduce supply chain risk and negotiate better terms with hardware partners. For the broader AI industry, it validates Cerebras's wafer-scale approach as a viable alternative for inference workloads.

The partnership follows OpenAI's pattern of vertical integration — designing custom inference infrastructure rather than relying solely on commodity GPU clusters. As AI model inference costs become the dominant expense for production deployments, hardware specialization offers a path to significantly lower cost-per-token.

Speed vs Accuracy Tradeoffs

Codex-Spark is not a universal replacement for GPT-5.3-Codex. The two models serve different interaction patterns. Understanding when to use each model is critical for development teams integrating Codex into their workflows.

Spark Excels At
Interactive, real-time coding tasks
  • Precise single-file edits and refactors
  • Revising plans and implementation approaches
  • Contextual Q&A about codebases
  • Autocomplete-style code suggestions
Full Codex Better For
Complex, autonomous coding tasks
  • Complex multi-file refactors
  • Long-running agentic coding loops
  • Deep debugging across codebases
  • Architecture-level code generation

Latency Optimizations

Beyond raw token throughput, OpenAI introduced several infrastructure-level optimizations alongside Codex-Spark that reduce end-to-end latency for all Codex models:

OptimizationImprovementImpact
WebSocket Optimizations80% reductionPer-roundtrip overhead
Responses API Improvements30% reductionPer-token overhead
Time-to-First-Token50% reductionInitial response latency

These optimizations compound with the hardware-level speed improvements. The net result is that a typical Codex-Spark interaction — send prompt, receive edited code — completes in under 300 milliseconds for most tasks. This is fast enough to integrate into keystroke-level IDE interactions without disrupting the developer's typing flow.

Where Spark Fits in the Codex Ecosystem

With Codex-Spark, OpenAI now offers three distinct coding models, each optimized for different development workflows. Understanding the ecosystem helps teams select the right model for each task type.

ModelBest ForSpeedComplexity
Codex-SparkInteractive pair-programming1,000+ tok/sSingle-file edits
GPT-5.3-CodexAutonomous coding agents~67 tok/sMulti-file refactors
GPT-5.2-CodexCost-effective fallback~85 tok/sStandard tasks

The practical pattern for most development teams is to use Codex-Spark as the default for IDE integrations — inline completions, quick edits, and conversational coding — while routing complex multi-file tasks to the full GPT-5.3-Codex model. GPT-5.2-Codex serves as a cost-effective fallback for high-volume, lower-complexity tasks like code review suggestions and documentation generation.

Availability and Access

Codex-Spark launched on February 12, 2026 as a research preview available exclusively to ChatGPT Pro subscribers. Access is provided through three interfaces, all designed for developer workflows rather than general chat.

Codex App

Web-based coding environment

Codex CLI

Terminal-based coding assistant

VS Code

Extension for inline coding

Current Limitations

  • Text-only — No vision or image understanding capabilities. Code and text inputs only.
  • 128K context window — Adequate for most single-file tasks but smaller than the 1M token windows offered by some competitors.
  • Research preview — Rate limits apply, and availability may be constrained during peak usage as Cerebras infrastructure scales.
  • ChatGPT Pro only — No API access, no free tier, no enterprise deployment options during the preview period.
  • Cerebras-only infrastructure — Currently limited to Cerebras WSE-3 hardware, which constrains geographic availability and total capacity.

OpenAI has not announced a general availability date or API access timeline. The company is working with Cerebras to expand datacenter capacity, but the wafer-scale manufacturing process limits how quickly supply can scale.

Developer Guide

Getting started with Codex-Spark requires a ChatGPT Pro subscription and one of three access methods. Here is how to set up each integration and when to use different speed and thinking tiers.

Getting Started

1. VS Code Extension

Install the OpenAI Codex extension from the VS Code marketplace. Sign in with your ChatGPT Pro account. Spark appears as the default model for inline completions and quick edits. The full Codex model handles multi-file tasks automatically.

2. Codex CLI

Install via npm install -g @openai/codex. Authenticate with your Pro account. The CLI automatically selects Spark for interactive mode and full Codex for autonomous tasks based on the command pattern.

3. Codex Web App

Access at codex.openai.com with your ChatGPT Pro account. Toggle between Spark and full Codex models in the model selector. The web app supports file uploads, project context, and collaborative editing sessions.

Choosing the Right Model

Task TypeRecommended ModelWhy
Inline completionsCodex-SparkNear-instant response needed
Bug fix in single fileCodex-SparkFast iteration on edits
Multi-file feature implementationGPT-5.3-CodexNeeds deeper reasoning
Large-scale refactorGPT-5.3-CodexComplex dependency analysis
Code review commentsGPT-5.2-CodexCost-effective for volume

The ideal setup for most teams is to configure Codex-Spark as the default for VS Code inline completions and quick-edit panels, with the full GPT-5.3-Codex model available on-demand for complex tasks. This mirrors how many teams already use different AI models for different purposes — fast models for interactive work, powerful models for autonomous operations.

Conclusion

GPT-5.3-Codex-Spark marks a meaningful shift in how AI coding tools are built and deployed. By achieving GPT-5.3-Codex-level accuracy at 15x the speed on Cerebras hardware, OpenAI has created the first AI coding model that genuinely operates at the speed of thought. The 1,000+ tokens per second throughput enables interaction patterns — inline completions, rapid iteration, conversational debugging — that were previously impossible with frontier-class models.

The Cerebras partnership adds strategic depth beyond raw performance. OpenAI's willingness to deploy production models on non-NVIDIA hardware signals a maturing inference infrastructure strategy that prioritizes workload-specific optimization over commodity GPU scaling. For developers and engineering teams, the practical takeaway is clear: real-time AI pair-programming is no longer a research concept — it is a production capability.

Ready to Accelerate Your Development?

Whether you're integrating AI coding tools, building custom development workflows, or modernizing your engineering stack, our team can help you leverage the latest AI models for measurable productivity gains.

Free consultation
Expert AI integration guidance
Tailored solutions

Frequently Asked Questions

Related Guides

Continue exploring AI coding models and developer tools