StepFun released Step 3.7 Flash, an open-weight agentic vision model built on a 198B-parameter sparse Mixture-of-Experts (MoE) architecture that activates only about 11B parameters per token. It pairs a 196B language backbone with a 1.8B vision encoder, ships under Apache 2.0, and is aimed squarely at high-frequency production agentic workflows rather than leaderboard supremacy.
The coverage that greeted the release led with a benchmark number — 56.3% on SWE-Bench Pro. That is a respectable result for an open-weight model, but it is not the story. The story is the arithmetic underneath it: a model that runs roughly 11B parameters of compute per token, prices output at $1.15 per million tokens, and (per StepFun's own scaffold) runs an agentic task for about $0.19 where a frontier model runs the same task for closer to $1.76. For teams running thousands of agent loops a month, that ratio is the headline.
This guide separates what each benchmark actually tells you, walks through the sparse-MoE math that makes the cost case real, and lays out where Step 3.7 Flash fits — and where it does not — in a multi-vendor production stack. Most performance figures here are vendor-stated at release; we label them as such and lean on the cleaner independent signals where they exist.
- 01A 198B sparse MoE that runs like a small model.Step 3.7 Flash holds 198B total parameters (196B language backbone plus a 1.8B vision encoder) but activates only ~11B per token via top-8-of-288 expert routing. That is roughly 5.6% of the weights doing compute on any given token.
- 02Lead with SWE-Bench Pro, not Verified.Vendor-stated 56.3% on SWE-Bench Pro put it 15th of 35 models and 2nd among non-Anthropic models at release. The higher Verified score (76.5%) sits on a benchmark frontier labs have flagged for contamination — Pro, with 1,865 multi-language tasks, is the cleaner signal.
- 03The cost case is the real product.At $0.20/M input and $1.15/M output, with an 80% cache-hit discount, Step 3.7 Flash is priced for volume. StepFun's internal 'Advisor Mode' comparison reports ~$0.19/task versus ~$1.76 for a larger model — a vendor-benchmarked figure, not an independent audit.
- 04First multimodal model in the Flash line.Step 3.5 Flash (February 2026) was text-only. Step 3.7 Flash adds a dedicated 1.8B Vision Transformer that injects image representations into the language context — parsing charts, PDFs, UI wireframes, and app GUIs without a separate vision API call.
- 05Open weights, locally deployable, Apache 2.0.Weights ship in BF16, FP8, NVFP4, and GGUF formats. The IQ4_XS GGUF (~105GB) fits a single 128GB unified-memory machine such as a Mac Studio Ultra or DGX Spark — a notable deployment fact for a model in this capability tier.
01 — What ShippedA multimodal Flash model, open under Apache 2.0.
Step 3.7 Flash listed on OpenRouter on May 28, 2026, with the StepFun blog and Hugging Face commit history pointing to May 29 — so we treat it as a late-May 2026 release. It is the third entry in StepFun's Step-3 family: Step-3 (July 2025), the text-only Step-3.5-Flash (February 2026), and now Step-3.7-Flash, the first multimodal model in the Flash line.
StepFun is one of China's "AI Six Tigers," founded in 2023 and led by CEO Jiang Daxin, previously a Global VP and Chief Scientist at Microsoft. The company reportedly closed a pre-IPO round of roughly $2.5B around May 2026 at an approximately $10B post-money valuation, with a Hong Kong listing in preparation — context that explains the cadence and the open-weight strategy.
196B MoE
A sparse Mixture-of-Experts transformer: 45 language layers, 42 of them routed MoE with 288 experts each and top-8 activation per token. Apache 2.0, open weights on Hugging Face.
1.8B ViT
A dedicated Vision Transformer injects image representations into the language context at inference — no separate vision-model call. Driven via an OpenAI-compatible messages API with image content parts.
Note the naming: the Hugging Face repository uses the hyphenated form Step-3.7-Flash, while StepFun's prose uses "Step 3.7 Flash." They refer to the same model. One thing this release is not: any kind of Step 4. StepFun has only announced training on a next-generation model — there is no shipped Step 4 to evaluate, and we treat any claim otherwise as unconfirmed.
02 — Sparse MoEWhy 11B active is not an 11B model.
The defining number for Step 3.7 Flash is the gap between its total and active parameter counts. The language backbone is 196B parameters; the full model, including the vision encoder, is 198B (about 201B counting projectors). Yet only ~11B parameters activate per token — roughly 5.6% of the weights.
The mechanism is fine-grained expert routing. Of 45 language layers, 42 are routed MoE layers, each holding 288 experts. For every token, a learned router selects the top 8 experts to run. That is 8 of 288 experts firing per layer — the rest of the weights sit idle for that token. Multiply across layers and the result is a model that stores the knowledge of a 198B network but pays the per-token compute bill of something far smaller.
Step 3.7 Flash · stored weights vs per-token active compute
Source: StepFun blog, Hugging Face model card, NVIDIA developer blogThis is the whole efficiency thesis in one sentence: you carry the memory cost of a large model but pay the throughput and per-token cost of a small one. It is why StepFun positions the model for "high-frequency production workloads" — agent loops that call a model hundreds or thousands of times per task, where per-token cost compounds fast. On independent measurement, Artificial Analysis recorded 415.9 tokens/second of output throughput (a top ranking among the models it evaluated) against a vendor-stated figure of up to 400 tokens/second.
03 — MultimodalA vision encoder that turns interfaces into actions.
The vision capability is the genuine new thing here. A dedicated 1.8B Vision Transformer encodes images and injects their representations into the language backbone's context at inference time, so there is no separate vision-model API call to orchestrate. In practice the model parses dense visual interfaces — UI wireframes, application GUIs, data charts, receipts, document tables — and maps them to structured code or data through templates such as chart-to-data, receipt-to-table, and GUI-action workflows.
StepFun also reports an emergent behavior worth noting carefully: in testing, the model combined visual and non-visual tools without being explicitly trained to. It independently called Python utilities — crop, zoom, bounding-box — alongside text tools inside multi-step visual-search workflows. StepFun describes this as an "emergent ability." It is a vendor observation rather than an independently reproduced benchmark, so we read it as promising signal rather than a guaranteed property.
SimpleVQA (Search)
A vendor-stated top ranking on visual question answering with search. Pairs the vision encoder with tool use to resolve questions that need both image understanding and external lookup.
HR-Bench 4K
Vendor-stated result on high-resolution image perception — the kind of fine-grained reading a model needs to extract numbers from a chart or fields from a scanned document.
Android Daily
Vendor-stated score on an Android GUI-action benchmark. Reflects the model's intended use as a vision-driven agent that reads a screen and decides the next interaction.
04 — BenchmarksWhich number you should actually believe.
Step 3.7 Flash arrived with a wall of benchmark scores. For a production decision, the question is not "how high?" but "how trustworthy?" The table below is our own scorecard: for each agentic benchmark relevant to this model, what it measures, how exposed it is to contamination or vendor control, and the reported score. Treat it as a guide to weighting, not a ranking.
| Benchmark | What it measures | Trust risk | Score |
|---|---|---|---|
| SWE-Bench Pro | 1,865 tasks, 41 repos, multi-language (Python/Go/TS/JS), human-augmented specs | Low — contamination-mitigated, independent leaderboard | 56.3% · 15th of 35 |
| SWE-Bench Verified | 500 Python-only tasks | Elevated — frontier labs flag training-data contamination | 76.5% (vendor) |
| Terminal-Bench 2.1 | 89 sandboxed terminal tasks, pass@1, time-limited | Vendor-stated — not yet seen on the public leaderboard | 59.5% (vendor) |
| ClawEval-1.1 | Tool-use / agentic capability eval | Vendor-stated, corroborated by a third-party review | 67.1% (vendor) |
| Toolathlon | Multi-tool orchestration tasks | Vendor-stated | 49.5% (vendor) |
Lead with SWE-Bench Pro. At a vendor-stated 56.3% it placed 15th on a 35-model leaderboard and 2nd among non-Anthropic models at release. Pro spans 1,865 tasks across 41 repositories in four languages, averaging 107 lines and four files changed per task, with a three-stage human-augmentation process and contamination mitigation. That breadth is exactly why it is harder to game than Verified.
Do not lead with SWE-Bench Verified. The vendor-stated 76.5% looks more impressive, but Verified is 500 Python-only tasks, and frontier labs — OpenAI among them — have flagged training-data contamination there to the point of discontinuing the score. A higher number on a more contaminated benchmark is the wrong signal to optimize for. The Terminal-Bench 2.1 figure of 59.5% is likewise vendor-stated and had not appeared on the public leaderboard at the time of writing; treat it as promising rather than confirmed.
05 — The Cost CaseThe math that makes this a production decision.
Here is the part most coverage buried. StepFun published an "Advisor Mode" comparison in which Step 3.7 Flash runs an agentic task for about $0.19 against roughly $1.76 for a larger frontier model in the same configuration — a stated 9.3× cost reduction at, per StepFun, comparable coding capability. Advisor Mode uses Step 3.7 Flash as the primary executor for iterative tool calls, escalating to a bigger model only at hard planning inflection points or after repeated failures.
Read those two numbers with care: both come from StepFun's own scaffold, against a model StepFun chose, with no independent cost audit. The directional claim — that a sparse-MoE executor priced at $0.20/M input and $1.15/M output is dramatically cheaper to loop than a frontier model — is well supported by the pricing alone. The exact 9.3× multiple is not something we would underwrite without running the comparison on your own tasks.
| Advisor Mode (vendor-stated) | Step 3.7 Flash | Larger frontier model |
|---|---|---|
| Cost per agentic task | ~$0.19 | ~$1.76 |
| Input $/M (cache miss) | $0.20 | Higher |
| Input $/M (cache hit) | $0.04 | — |
| Output $/M tokens | $1.15 | Higher |
| Role in the loop | Primary executor | Escalation only |
The practical exercise is a breakeven calculation on your own volume. A team running 1,000 agentic tasks a month is talking about a small absolute difference; a team running 100,000 is choosing between a four-figure and a five-figure monthly bill for the same work. That is where an open-weight, cheap-to-loop executor stops being a curiosity and becomes a line-item decision. If you want that modeled against your actual workloads, our AI and digital transformation engagements start with exactly this kind of cost-and-capability eval. For the wider pricing picture, our Q2 2026 LLM API pricing index shows where Step 3.7 Flash's $0.20/M input sits in the current landscape.
"This design achieves a coding level comparable to Claude Opus 4.6 at 97% for an average cost of only $0.19 per task versus approximately $1.76 for the larger model."— Communeify technical analysis (citing StepFun's internal benchmarks)
06 — Reasoning EffortThree effort levels, set per call.
Step 3.7 Flash exposes three selectable reasoning levels — low, medium, and high — controlled by the output_config.effort parameter on the API. That lets a caller trade inference speed and cost against analytical depth on a per-request basis, which matters a great deal for a model whose entire pitch is high-frequency, cost-sensitive agent loops.
Low effort
Simple, latency-sensitive tasks where speed and cost dominate. The right default for high-volume routine steps inside an agent loop.
Medium effort
General reasoning default. Balances depth and cost for typical tool-use and code tasks — the level most production loops will spend the bulk of their calls on.
High effort
Complex math, planning, and code analysis. Step 3.7 Flash generates substantially more reasoning tokens at this level — verbose chain-of-thought is the cost of the extra depth.
A useful signal on consistency: StepFun reports that Step 3.7 Flash narrowed its performance variance across different agentic scaffolds from a wide 43–73% spread on the prior 3.5 Flash to a tighter 64.5–71.5%. For production teams, predictable behavior across harnesses often matters more than a peak score on one — a model that performs evenly whether it runs under Claude Code, Cline, Roo Code, or Kilo Code is easier to operate. An independent DGX Spark review reported a 100% tool-call success rate across its multi-step test scenarios, which is consistent with that framing, though it is a single hands-on result rather than a broad benchmark.
07 — Run It TodayCloud, open weights, or on your desk.
Three deployment paths are available, and the Apache 2.0 license means the on-prem route carries no usage restrictions. Pick the surface that matches the workload.
platform.stepfun.aihuggingface.co/stepfun-ai| Surface | Exposure | Best for |
|---|---|---|
platform.stepfun.ai | Hosted API · OpenAI-compatible | Production integration with image content parts and the effort parameter. Cache-hit pricing rewards repeated context. Also live on OpenRouter and NVIDIA NIM. |
huggingface.co/stepfun-ai | Open weights · Apache 2.0 | On-prem and fine-tuning. BF16, FP8, NVFP4, and GGUF formats. Runs under vLLM, SGLang, Transformers v5+, llama.cpp, and TensorRT-LLM; NVIDIA offers day-0 NeMo fine-tuning. |
| Local GGUF (single box) | IQ4_XS quant · ~105GB | Fits 128GB of unified memory — a Mac Studio Ultra or DGX Spark. Full BF16 is ~394GB and needs multi-GPU. Local quantized throughput is far lower than cloud (a review saw ~27 tok/s). |
One caveat on throughput: the headline 400–415 tokens/second figures are cloud inference on NVIDIA hardware. Local quantized deployment is a different regime — an independent DGX Spark review measured roughly 27 tokens/second on quantized weights. The local story is about sovereignty, control, and zero per-token cost, not raw speed. The model is also compatible with the common agentic coding harnesses (Claude Code, Cline, Roo Code, Kilo Code), so slotting it in as a cheaper executor rarely means rebuilding your tooling. If you are mapping the broader harness landscape, see our Q2 2026 agentic coding platform matrix.
08 — Where It FitsThe decision tree for production teams.
Step 3.7 Flash does not change the frontier picture. It changes the economics of a few specific workload classes — and for those, it is worth a serious evaluation against your current stack.
Cost-sensitive tool-use at scale
Thousands of agentic tasks a month where per-token cost compounds. The sparse-MoE economics and cache-hit pricing are the whole point. Benchmark Advisor-Mode-style executor patterns on your own tasks first.
GUI, chart, and document parsing
Workflows that read screens, charts, receipts, or wireframes and act on them. The integrated 1.8B vision encoder removes a separate vision call. Validate the emergent tool-use claims on your data.
Apache 2.0, locally deployable
Compliance-bound or air-gapped deployments. The IQ4_XS GGUF fits a single 128GB box, and the permissive license carries no usage limits — but plan for local throughput far below cloud rates.
Frontier-tier capability
The most demanding code generation and reasoning still belongs to the frontier tier (Claude Opus 4.8 led the SWE-Bench Pro board). Keep a frontier model for the hard cases and escalate to it.
The cleanest production pattern is the one StepFun's own Advisor Mode gestures at: route the high-frequency, lower-stakes executor calls to Step 3.7 Flash and reserve a frontier model for planning inflection points and failures. That is multi-vendor routing by task class, and it is where an open-weight, cheap-to-loop model earns its place — not by beating the frontier, but by handling the 80% of calls that never needed the frontier in the first place. It sits naturally alongside the wider open-weight agentic wave that includes models like Kimi K2.5 and the open-source agent-swarm movement, and its 256K window is worth reading against the broader 2026 long-context comparison.
09 — ConclusionA cost-efficiency play, honestly framed.
The benchmark is the headline everyone reads; the cost is the one that changes decisions.
Step 3.7 Flash is a well-engineered sparse MoE that does one thing unusually well: it runs agentic loops cheaply. A 198B model that activates ~11B per token, prices output at $1.15/M, ships under Apache 2.0, and adds a native vision encoder is a genuinely useful tool for high-frequency production work — provided you read its numbers correctly.
And reading them correctly means leading with SWE-Bench Pro (56.3%, the cleaner signal) over the contamination-flagged Verified score, treating the Terminal-Bench 2.1 and Advisor-Mode cost figures as vendor-stated until an independent harness confirms them, and never mistaking ~11B active compute for an 11B memory footprint. The directional case — frontier-adjacent coding capability at a small fraction of the cost — is strong. The precise multiples are not yet independently audited.
The broader signal is the one that matters most. The open-weight field is no longer trying only to match the frontier on raw capability; it is competing on cost-per-task for the enormous volume of agentic work that never needed a frontier model. Step 3.7 Flash is a clear, sober entry on that side of the line — and the right response is not to switch on a headline, but to run the breakeven and the eval on the workloads you actually care about.