StepFun released Step 3.7 Flash, an open-weight agentic vision model built on a 198B-parameter sparse Mixture-of-Experts (MoE) architecture that activates only about 11B parameters per token. It pairs a 196B language backbone with a 1.8B vision encoder, ships under Apache 2.0, and is aimed squarely at high-frequency production agentic workflows rather than leaderboard supremacy.

The coverage that greeted the release led with a benchmark number — 56.3% on SWE-Bench Pro. That is a respectable result for an open-weight model, but it is not the story. The story is the arithmetic underneath it: a model that runs roughly 11B parameters of compute per token, prices output at $1.15 per million tokens, and (per StepFun's own scaffold) runs an agentic task for about $0.19 where a frontier model runs the same task for closer to $1.76. For teams running thousands of agent loops a month, that ratio is the headline.

This guide separates what each benchmark actually tells you, walks through the sparse-MoE math that makes the cost case real, and lays out where Step 3.7 Flash fits — and where it does not — in a multi-vendor production stack. Most performance figures here are vendor-stated at release; we label them as such and lean on the cleaner independent signals where they exist.

Key takeaways

01
A 198B sparse MoE that runs like a small model.Step 3.7 Flash holds 198B total parameters (196B language backbone plus a 1.8B vision encoder) but activates only ~11B per token via top-8-of-288 expert routing. That is roughly 5.6% of the weights doing compute on any given token.
02
Lead with SWE-Bench Pro, not Verified.Vendor-stated 56.3% on SWE-Bench Pro put it 15th of 35 models and 2nd among non-Anthropic models at release. The higher Verified score (76.5%) sits on a benchmark frontier labs have flagged for contamination — Pro, with 1,865 multi-language tasks, is the cleaner signal.
03
The cost case is the real product.At $0.20/M input and $1.15/M output, with an 80% cache-hit discount, Step 3.7 Flash is priced for volume. StepFun's internal 'Advisor Mode' comparison reports ~$0.19/task versus ~$1.76 for a larger model — a vendor-benchmarked figure, not an independent audit.
04
First multimodal model in the Flash line.Step 3.5 Flash (February 2026) was text-only. Step 3.7 Flash adds a dedicated 1.8B Vision Transformer that injects image representations into the language context — parsing charts, PDFs, UI wireframes, and app GUIs without a separate vision API call.
05
Open weights, locally deployable, Apache 2.0.Weights ship in BF16, FP8, NVFP4, and GGUF formats. The IQ4_XS GGUF (~105GB) fits a single 128GB unified-memory machine such as a Mac Studio Ultra or DGX Spark — a notable deployment fact for a model in this capability tier.

01 — What ShippedA multimodal Flash model, open under Apache 2.0.

Step 3.7 Flash listed on OpenRouter on May 28, 2026, with the StepFun blog and Hugging Face commit history pointing to May 29 — so we treat it as a late-May 2026 release. It is the third entry in StepFun's Step-3 family: Step-3 (July 2025), the text-only Step-3.5-Flash (February 2026), and now Step-3.7-Flash, the first multimodal model in the Flash line.

StepFun is one of China's "AI Six Tigers," founded in 2023 and led by CEO Jiang Daxin, previously a Global VP and Chief Scientist at Microsoft. The company reportedly closed a pre-IPO round of roughly $2.5B around May 2026 at an approximately $10B post-money valuation, with a Hong Kong listing in preparation — context that explains the cadence and the open-weight strategy.

Language backbone

196B MoE

198B total · ~11B active · 256K context

A sparse Mixture-of-Experts transformer: 45 language layers, 42 of them routed MoE with 288 experts each and top-8 activation per token. Apache 2.0, open weights on Hugging Face.

huggingface.co/stepfun-ai/Step-3.7-Flash

Vision encoder

1.8B ViT

native image, chart, PDF, GUI input

A dedicated Vision Transformer injects image representations into the language context at inference — no separate vision-model call. Driven via an OpenAI-compatible messages API with image content parts.

platform.stepfun.ai

Release snapshot

Step 3.7 Flash shipped in late May 2026 under an Apache 2.0 license — fully open for commercial use. Weights are published on Hugging Face in BF16, FP8, NVFP4, and GGUF formats; the model is live on the StepFun platform, OpenRouter, and NVIDIA NIM, with DeepInfra, Fireworks AI, and Modal listed as coming soon. Pricing on launch surfaces: $0.20/M input (cache miss), $0.04/M cache hit, and $1.15/M output tokens.

Note the naming: the Hugging Face repository uses the hyphenated form Step-3.7-Flash, while StepFun's prose uses "Step 3.7 Flash." They refer to the same model. One thing this release is not: any kind of Step 4. StepFun has only announced training on a next-generation model — there is no shipped Step 4 to evaluate, and we treat any claim otherwise as unconfirmed.

02 — Sparse MoEWhy 11B active is not an 11B model.

The defining number for Step 3.7 Flash is the gap between its total and active parameter counts. The language backbone is 196B parameters; the full model, including the vision encoder, is 198B (about 201B counting projectors). Yet only ~11B parameters activate per token — roughly 5.6% of the weights.

The mechanism is fine-grained expert routing. Of 45 language layers, 42 are routed MoE layers, each holding 288 experts. For every token, a learned router selects the top 8 experts to run. That is 8 of 288 experts firing per layer — the rest of the weights sit idle for that token. Multiply across layers and the result is a model that stores the knowledge of a 198B network but pays the per-token compute bill of something far smaller.

Step 3.7 Flash · stored weights vs per-token active compute

Source: StepFun blog, Hugging Face model card, NVIDIA developer blog

Total parameters stored196B language backbone + 1.8B vision encoder

198B

Active parameters per tokentop-8 of 288 experts × 42 routed layers

~11B

Active share of totalthe per-token compute footprint

~5.6%

The footgun to avoid

A model that activates ~11B parameters per token is not "like running an 11B model." The 11B figure is inference compute, not memory: all 198B parameters still have to live on the hardware. Plan memory for 198B (or its quantized equivalent) and latency for ~11B. Conflating the two is the most common mistake teams make when sizing a sparse MoE deployment.

This is the whole efficiency thesis in one sentence: you carry the memory cost of a large model but pay the throughput and per-token cost of a small one. It is why StepFun positions the model for "high-frequency production workloads" — agent loops that call a model hundreds or thousands of times per task, where per-token cost compounds fast. On independent measurement, Artificial Analysis recorded 415.9 tokens/second of output throughput (a top ranking among the models it evaluated) against a vendor-stated figure of up to 400 tokens/second.

03 — MultimodalA vision encoder that turns interfaces into actions.

The vision capability is the genuine new thing here. A dedicated 1.8B Vision Transformer encodes images and injects their representations into the language backbone's context at inference time, so there is no separate vision-model API call to orchestrate. In practice the model parses dense visual interfaces — UI wireframes, application GUIs, data charts, receipts, document tables — and maps them to structured code or data through templates such as chart-to-data, receipt-to-table, and GUI-action workflows.

StepFun also reports an emergent behavior worth noting carefully: in testing, the model combined visual and non-visual tools without being explicitly trained to. It independently called Python utilities — crop, zoom, bounding-box — alongside text tools inside multi-step visual-search workflows. StepFun describes this as an "emergent ability." It is a vendor observation rather than an independently reproduced benchmark, so we read it as promising signal rather than a guaranteed property.

Visual search

SimpleVQA (Search)

79.2%

A vendor-stated top ranking on visual question answering with search. Pairs the vision encoder with tool use to resolve questions that need both image understanding and external lookup.

vendor-stated

High-res perception

HR-Bench 4K

89.1%

Vendor-stated result on high-resolution image perception — the kind of fine-grained reading a model needs to extract numbers from a chart or fields from a scanned document.

vendor-stated

GUI agents

Android Daily

61.9%

Vendor-stated score on an Android GUI-action benchmark. Reflects the model's intended use as a vision-driven agent that reads a screen and decides the next interaction.

vendor-stated

04 — BenchmarksWhich number you should actually believe.

Step 3.7 Flash arrived with a wall of benchmark scores. For a production decision, the question is not "how high?" but "how trustworthy?" The table below is our own scorecard: for each agentic benchmark relevant to this model, what it measures, how exposed it is to contamination or vendor control, and the reported score. Treat it as a guide to weighting, not a ranking.

Benchmark	What it measures	Trust risk	Score
SWE-Bench Pro	1,865 tasks, 41 repos, multi-language (Python/Go/TS/JS), human-augmented specs	Low — contamination-mitigated, independent leaderboard	56.3% · 15th of 35
SWE-Bench Verified	500 Python-only tasks	Elevated — frontier labs flag training-data contamination	76.5% (vendor)
Terminal-Bench 2.1	89 sandboxed terminal tasks, pass@1, time-limited	Vendor-stated — not yet seen on the public leaderboard	59.5% (vendor)
ClawEval-1.1	Tool-use / agentic capability eval	Vendor-stated, corroborated by a third-party review	67.1% (vendor)
Toolathlon	Multi-tool orchestration tasks	Vendor-stated	49.5% (vendor)

Lead with SWE-Bench Pro. At a vendor-stated 56.3% it placed 15th on a 35-model leaderboard and 2nd among non-Anthropic models at release. Pro spans 1,865 tasks across 41 repositories in four languages, averaging 107 lines and four files changed per task, with a three-stage human-augmentation process and contamination mitigation. That breadth is exactly why it is harder to game than Verified.

Do not lead with SWE-Bench Verified. The vendor-stated 76.5% looks more impressive, but Verified is 500 Python-only tasks, and frontier labs — OpenAI among them — have flagged training-data contamination there to the point of discontinuing the score. A higher number on a more contaminated benchmark is the wrong signal to optimize for. The Terminal-Bench 2.1 figure of 59.5% is likewise vendor-stated and had not appeared on the public leaderboard at the time of writing; treat it as promising rather than confirmed.

Where it sits versus frontier

Step 3.7 Flash is not competing with the frontier coding tier. On the same SWE-Bench Pro leaderboard, Anthropic models led the field — Claude Opus 4.8 around 69%, with a preview model higher still. Step 3.7 Flash is filling a different niche: a large share of frontier coding capability at a small fraction of the cost, open-weight and locally deployable. Read it as a cost-efficiency play, not a frontier challenger.

05 — The Cost CaseThe math that makes this a production decision.

Here is the part most coverage buried. StepFun published an "Advisor Mode" comparison in which Step 3.7 Flash runs an agentic task for about $0.19 against roughly $1.76 for a larger frontier model in the same configuration — a stated 9.3× cost reduction at, per StepFun, comparable coding capability. Advisor Mode uses Step 3.7 Flash as the primary executor for iterative tool calls, escalating to a bigger model only at hard planning inflection points or after repeated failures.

Read those two numbers with care: both come from StepFun's own scaffold, against a model StepFun chose, with no independent cost audit. The directional claim — that a sparse-MoE executor priced at $0.20/M input and $1.15/M output is dramatically cheaper to loop than a frontier model — is well supported by the pricing alone. The exact 9.3× multiple is not something we would underwrite without running the comparison on your own tasks.

Advisor Mode (vendor-stated)	Step 3.7 Flash	Larger frontier model
Cost per agentic task	~$0.19	~$1.76
Input $/M (cache miss)	$0.20	Higher
Input $/M (cache hit)	$0.04	—
Output $/M tokens	$1.15	Higher
Role in the loop	Primary executor	Escalation only

The practical exercise is a breakeven calculation on your own volume. A team running 1,000 agentic tasks a month is talking about a small absolute difference; a team running 100,000 is choosing between a four-figure and a five-figure monthly bill for the same work. That is where an open-weight, cheap-to-loop executor stops being a curiosity and becomes a line-item decision. If you want that modeled against your actual workloads, our AI and digital transformation engagements start with exactly this kind of cost-and-capability eval. For the wider pricing picture, our Q2 2026 LLM API pricing index shows where Step 3.7 Flash's $0.20/M input sits in the current landscape.

"This design achieves a coding level comparable to Claude Opus 4.6 at 97% for an average cost of only $0.19 per task versus approximately $1.76 for the larger model."— Communeify technical analysis (citing StepFun's internal benchmarks)

06 — Reasoning EffortThree effort levels, set per call.

Step 3.7 Flash exposes three selectable reasoning levels — low, medium, and high — controlled by the output_config.effort parameter on the API. That lets a caller trade inference speed and cost against analytical depth on a per-request basis, which matters a great deal for a model whose entire pitch is high-frequency, cost-sensitive agent loops.

Fast

Low effort

effort: low

Simple, latency-sensitive tasks where speed and cost dominate. The right default for high-volume routine steps inside an agent loop.

Lowest cost · lowest latency

Default

Medium effort

effort: medium

General reasoning default. Balances depth and cost for typical tool-use and code tasks — the level most production loops will spend the bulk of their calls on.

Balanced default

Deep

High effort

effort: high

Complex math, planning, and code analysis. Step 3.7 Flash generates substantially more reasoning tokens at this level — verbose chain-of-thought is the cost of the extra depth.

Max depth · more tokens

A useful signal on consistency: StepFun reports that Step 3.7 Flash narrowed its performance variance across different agentic scaffolds from a wide 43–73% spread on the prior 3.5 Flash to a tighter 64.5–71.5%. For production teams, predictable behavior across harnesses often matters more than a peak score on one — a model that performs evenly whether it runs under Claude Code, Cline, Roo Code, or Kilo Code is easier to operate. An independent DGX Spark review reported a 100% tool-call success rate across its multi-step test scenarios, which is consistent with that framing, though it is a single hands-on result rather than a broad benchmark.

07 — Run It TodayCloud, open weights, or on your desk.

Three deployment paths are available, and the Apache 2.0 license means the on-prem route carries no usage restrictions. Pick the surface that matches the workload.

Surface

platform.stepfun.ai

Exposure

Hosted API · OpenAI-compatible

Best for

Production integration with image content parts and the effort parameter. Cache-hit pricing rewards repeated context. Also live on OpenRouter and NVIDIA NIM.

Surface

huggingface.co/stepfun-ai

Exposure

Open weights · Apache 2.0

Best for

On-prem and fine-tuning. BF16, FP8, NVFP4, and GGUF formats. Runs under vLLM, SGLang, Transformers v5+, llama.cpp, and TensorRT-LLM; NVIDIA offers day-0 NeMo fine-tuning.

Surface

Local GGUF (single box)

Exposure

IQ4_XS quant · ~105GB

Best for

Fits 128GB of unified memory — a Mac Studio Ultra or DGX Spark. Full BF16 is ~394GB and needs multi-GPU. Local quantized throughput is far lower than cloud (a review saw ~27 tok/s).

Surface	Exposure	Best for
`platform.stepfun.ai`	Hosted API · OpenAI-compatible	Production integration with image content parts and the effort parameter. Cache-hit pricing rewards repeated context. Also live on OpenRouter and NVIDIA NIM.
`huggingface.co/stepfun-ai`	Open weights · Apache 2.0	On-prem and fine-tuning. BF16, FP8, NVFP4, and GGUF formats. Runs under vLLM, SGLang, Transformers v5+, llama.cpp, and TensorRT-LLM; NVIDIA offers day-0 NeMo fine-tuning.
Local GGUF (single box)	IQ4_XS quant · ~105GB	Fits 128GB of unified memory — a Mac Studio Ultra or DGX Spark. Full BF16 is ~394GB and needs multi-GPU. Local quantized throughput is far lower than cloud (a review saw ~27 tok/s).

One caveat on throughput: the headline 400–415 tokens/second figures are cloud inference on NVIDIA hardware. Local quantized deployment is a different regime — an independent DGX Spark review measured roughly 27 tokens/second on quantized weights. The local story is about sovereignty, control, and zero per-token cost, not raw speed. The model is also compatible with the common agentic coding harnesses (Claude Code, Cline, Roo Code, Kilo Code), so slotting it in as a cheaper executor rarely means rebuilding your tooling. If you are mapping the broader harness landscape, see our Q2 2026 agentic coding platform matrix.

08 — Where It FitsThe decision tree for production teams.

Step 3.7 Flash does not change the frontier picture. It changes the economics of a few specific workload classes — and for those, it is worth a serious evaluation against your current stack.

High-volume agent loops

Cost-sensitive tool-use at scale

Thousands of agentic tasks a month where per-token cost compounds. The sparse-MoE economics and cache-hit pricing are the whole point. Benchmark Advisor-Mode-style executor patterns on your own tasks first.

Pick Step 3.7 Flash

Vision-driven agents

GUI, chart, and document parsing

Workflows that read screens, charts, receipts, or wireframes and act on them. The integrated 1.8B vision encoder removes a separate vision call. Validate the emergent tool-use claims on your data.

Strong candidate

Sovereignty / on-prem

Apache 2.0, locally deployable

Compliance-bound or air-gapped deployments. The IQ4_XS GGUF fits a single 128GB box, and the permissive license carries no usage limits — but plan for local throughput far below cloud rates.

Pick open weights

Hardest coding & reasoning

Frontier-tier capability

The most demanding code generation and reasoning still belongs to the frontier tier (Claude Opus 4.8 led the SWE-Bench Pro board). Keep a frontier model for the hard cases and escalate to it.

Stay with frontier

The cleanest production pattern is the one StepFun's own Advisor Mode gestures at: route the high-frequency, lower-stakes executor calls to Step 3.7 Flash and reserve a frontier model for planning inflection points and failures. That is multi-vendor routing by task class, and it is where an open-weight, cheap-to-loop model earns its place — not by beating the frontier, but by handling the 80% of calls that never needed the frontier in the first place. It sits naturally alongside the wider open-weight agentic wave that includes models like Kimi K2.5 and the open-source agent-swarm movement, and its 256K window is worth reading against the broader 2026 long-context comparison.

09 — ConclusionA cost-efficiency play, honestly framed.

The shape of open agentic models, May 2026

The benchmark is the headline everyone reads; the cost is the one that changes decisions.

Step 3.7 Flash is a well-engineered sparse MoE that does one thing unusually well: it runs agentic loops cheaply. A 198B model that activates ~11B per token, prices output at $1.15/M, ships under Apache 2.0, and adds a native vision encoder is a genuinely useful tool for high-frequency production work — provided you read its numbers correctly.

And reading them correctly means leading with SWE-Bench Pro (56.3%, the cleaner signal) over the contamination-flagged Verified score, treating the Terminal-Bench 2.1 and Advisor-Mode cost figures as vendor-stated until an independent harness confirms them, and never mistaking ~11B active compute for an 11B memory footprint. The directional case — frontier-adjacent coding capability at a small fraction of the cost — is strong. The precise multiples are not yet independently audited.

The broader signal is the one that matters most. The open-weight field is no longer trying only to match the frontier on raw capability; it is competing on cost-per-task for the enormous volume of agentic work that never needed a frontier model. Step 3.7 Flash is a clear, sober entry on that side of the line — and the right response is not to switch on a headline, but to run the breakeven and the eval on the workloads you actually care about. That same Apache 2.0 wave now reaches well beyond code and vision, down to releases like Mistral's open-weight text-to-speech model.

StepFun Step 3.7 Flash: 196B MoE, Agentic Vision

01 — What ShippedA multimodal Flash model, open under Apache 2.0.

196B MoE

1.8B ViT

02 — Sparse MoEWhy 11B active is not an 11B model.

Step 3.7 Flash · stored weights vs per-token active compute

03 — MultimodalA vision encoder that turns interfaces into actions.

SimpleVQA (Search)

HR-Bench 4K

Android Daily

04 — BenchmarksWhich number you should actually believe.

05 — The Cost CaseThe math that makes this a production decision.

06 — Reasoning EffortThree effort levels, set per call.

Low effort

Medium effort

High effort

07 — Run It TodayCloud, open weights, or on your desk.

08 — Where It FitsThe decision tree for production teams.

Cost-sensitive tool-use at scale

GUI, chart, and document parsing

Apache 2.0, locally deployable

Frontier-tier capability

09 — ConclusionA cost-efficiency play, honestly framed.

The benchmark is the headline everyone reads; the cost is the one that changes decisions.

Cheap executors plus frontier escalation make agentic scale genuinely affordable.

Open-weight model engagements

The questions we get every week.

Continue exploring frontier releases.

Inkling: Murati’s Open-Weight Bet Lands on Hugging Face

MiniMax M3 vs Opus 4.8 vs GPT-5.5: Coding Showdown

MiniMax M3 Release: 1M-Context Agentic Frontier Model

Tencent's Hunyuan Hy3: Open-Weight Reasoning Arrives