AI DevelopmentNew Release10 min readPublished May 30, 2026

Open-weight agentic vision MoE · 256K context · ~11B active of 198B

StepFun Step 3.7 Flash: 196B MoE, Agentic Vision

StepFun shipped Step 3.7 Flash in late May 2026 — a 198B-parameter sparse Mixture-of-Experts model that activates roughly 11B parameters per token, pairs a 1.8B vision encoder with a 196B language backbone, and ships under Apache 2.0. The headline is not the benchmark; it's the cost of running agentic loops.

DA
Digital Applied Team
Senior strategists · Published May 30, 2026
PublishedMay 30, 2026
Read time10 min
Sources10 primary + secondary
SWE-Bench Pro
56.3%
vendor-stated · 15th of 35
2nd non-Anthropic
Active params / token
~11B
of 198B total
5.6% active
Context window
256K
input and output
License
Apache 2.0
open commercial use

StepFun released Step 3.7 Flash, an open-weight agentic vision model built on a 198B-parameter sparse Mixture-of-Experts (MoE) architecture that activates only about 11B parameters per token. It pairs a 196B language backbone with a 1.8B vision encoder, ships under Apache 2.0, and is aimed squarely at high-frequency production agentic workflows rather than leaderboard supremacy.

The coverage that greeted the release led with a benchmark number — 56.3% on SWE-Bench Pro. That is a respectable result for an open-weight model, but it is not the story. The story is the arithmetic underneath it: a model that runs roughly 11B parameters of compute per token, prices output at $1.15 per million tokens, and (per StepFun's own scaffold) runs an agentic task for about $0.19 where a frontier model runs the same task for closer to $1.76. For teams running thousands of agent loops a month, that ratio is the headline.

This guide separates what each benchmark actually tells you, walks through the sparse-MoE math that makes the cost case real, and lays out where Step 3.7 Flash fits — and where it does not — in a multi-vendor production stack. Most performance figures here are vendor-stated at release; we label them as such and lean on the cleaner independent signals where they exist.

Key takeaways
  1. 01
    A 198B sparse MoE that runs like a small model.Step 3.7 Flash holds 198B total parameters (196B language backbone plus a 1.8B vision encoder) but activates only ~11B per token via top-8-of-288 expert routing. That is roughly 5.6% of the weights doing compute on any given token.
  2. 02
    Lead with SWE-Bench Pro, not Verified.Vendor-stated 56.3% on SWE-Bench Pro put it 15th of 35 models and 2nd among non-Anthropic models at release. The higher Verified score (76.5%) sits on a benchmark frontier labs have flagged for contamination — Pro, with 1,865 multi-language tasks, is the cleaner signal.
  3. 03
    The cost case is the real product.At $0.20/M input and $1.15/M output, with an 80% cache-hit discount, Step 3.7 Flash is priced for volume. StepFun's internal 'Advisor Mode' comparison reports ~$0.19/task versus ~$1.76 for a larger model — a vendor-benchmarked figure, not an independent audit.
  4. 04
    First multimodal model in the Flash line.Step 3.5 Flash (February 2026) was text-only. Step 3.7 Flash adds a dedicated 1.8B Vision Transformer that injects image representations into the language context — parsing charts, PDFs, UI wireframes, and app GUIs without a separate vision API call.
  5. 05
    Open weights, locally deployable, Apache 2.0.Weights ship in BF16, FP8, NVFP4, and GGUF formats. The IQ4_XS GGUF (~105GB) fits a single 128GB unified-memory machine such as a Mac Studio Ultra or DGX Spark — a notable deployment fact for a model in this capability tier.

01What ShippedA multimodal Flash model, open under Apache 2.0.

Step 3.7 Flash listed on OpenRouter on May 28, 2026, with the StepFun blog and Hugging Face commit history pointing to May 29 — so we treat it as a late-May 2026 release. It is the third entry in StepFun's Step-3 family: Step-3 (July 2025), the text-only Step-3.5-Flash (February 2026), and now Step-3.7-Flash, the first multimodal model in the Flash line.

StepFun is one of China's "AI Six Tigers," founded in 2023 and led by CEO Jiang Daxin, previously a Global VP and Chief Scientist at Microsoft. The company reportedly closed a pre-IPO round of roughly $2.5B around May 2026 at an approximately $10B post-money valuation, with a Hong Kong listing in preparation — context that explains the cadence and the open-weight strategy.

Language backbone
196B MoE
198B total · ~11B active · 256K context

A sparse Mixture-of-Experts transformer: 45 language layers, 42 of them routed MoE with 288 experts each and top-8 activation per token. Apache 2.0, open weights on Hugging Face.

huggingface.co/stepfun-ai/Step-3.7-Flash
Vision encoder
1.8B ViT
native image, chart, PDF, GUI input

A dedicated Vision Transformer injects image representations into the language context at inference — no separate vision-model call. Driven via an OpenAI-compatible messages API with image content parts.

platform.stepfun.ai
Release snapshot
Step 3.7 Flash shipped in late May 2026 under an Apache 2.0 license — fully open for commercial use. Weights are published on Hugging Face in BF16, FP8, NVFP4, and GGUF formats; the model is live on the StepFun platform, OpenRouter, and NVIDIA NIM, with DeepInfra, Fireworks AI, and Modal listed as coming soon. Pricing on launch surfaces: $0.20/M input (cache miss), $0.04/M cache hit, and $1.15/M output tokens.

Note the naming: the Hugging Face repository uses the hyphenated form Step-3.7-Flash, while StepFun's prose uses "Step 3.7 Flash." They refer to the same model. One thing this release is not: any kind of Step 4. StepFun has only announced training on a next-generation model — there is no shipped Step 4 to evaluate, and we treat any claim otherwise as unconfirmed.

02Sparse MoEWhy 11B active is not an 11B model.

The defining number for Step 3.7 Flash is the gap between its total and active parameter counts. The language backbone is 196B parameters; the full model, including the vision encoder, is 198B (about 201B counting projectors). Yet only ~11B parameters activate per token — roughly 5.6% of the weights.

The mechanism is fine-grained expert routing. Of 45 language layers, 42 are routed MoE layers, each holding 288 experts. For every token, a learned router selects the top 8 experts to run. That is 8 of 288 experts firing per layer — the rest of the weights sit idle for that token. Multiply across layers and the result is a model that stores the knowledge of a 198B network but pays the per-token compute bill of something far smaller.

Step 3.7 Flash · stored weights vs per-token active compute

Source: StepFun blog, Hugging Face model card, NVIDIA developer blog
Total parameters stored196B language backbone + 1.8B vision encoder
198B
Active parameters per tokentop-8 of 288 experts × 42 routed layers
~11B
Active share of totalthe per-token compute footprint
~5.6%
The footgun to avoid
A model that activates ~11B parameters per token is not"like running an 11B model." The 11B figure is inference compute, not memory: all 198B parameters still have to live on the hardware. Plan memory for 198B (or its quantized equivalent) and latency for ~11B. Conflating the two is the most common mistake teams make when sizing a sparse MoE deployment.

This is the whole efficiency thesis in one sentence: you carry the memory cost of a large model but pay the throughput and per-token cost of a small one. It is why StepFun positions the model for "high-frequency production workloads" — agent loops that call a model hundreds or thousands of times per task, where per-token cost compounds fast. On independent measurement, Artificial Analysis recorded 415.9 tokens/second of output throughput (a top ranking among the models it evaluated) against a vendor-stated figure of up to 400 tokens/second.

03MultimodalA vision encoder that turns interfaces into actions.

The vision capability is the genuine new thing here. A dedicated 1.8B Vision Transformer encodes images and injects their representations into the language backbone's context at inference time, so there is no separate vision-model API call to orchestrate. In practice the model parses dense visual interfaces — UI wireframes, application GUIs, data charts, receipts, document tables — and maps them to structured code or data through templates such as chart-to-data, receipt-to-table, and GUI-action workflows.

StepFun also reports an emergent behavior worth noting carefully: in testing, the model combined visual and non-visual tools without being explicitly trained to. It independently called Python utilities — crop, zoom, bounding-box — alongside text tools inside multi-step visual-search workflows. StepFun describes this as an "emergent ability." It is a vendor observation rather than an independently reproduced benchmark, so we read it as promising signal rather than a guaranteed property.

Visual search
SimpleVQA (Search)
79.2%

A vendor-stated top ranking on visual question answering with search. Pairs the vision encoder with tool use to resolve questions that need both image understanding and external lookup.

vendor-stated
High-res perception
HR-Bench 4K
89.1%

Vendor-stated result on high-resolution image perception — the kind of fine-grained reading a model needs to extract numbers from a chart or fields from a scanned document.

vendor-stated
GUI agents
Android Daily
61.9%

Vendor-stated score on an Android GUI-action benchmark. Reflects the model's intended use as a vision-driven agent that reads a screen and decides the next interaction.

vendor-stated

04BenchmarksWhich number you should actually believe.

Step 3.7 Flash arrived with a wall of benchmark scores. For a production decision, the question is not "how high?" but "how trustworthy?" The table below is our own scorecard: for each agentic benchmark relevant to this model, what it measures, how exposed it is to contamination or vendor control, and the reported score. Treat it as a guide to weighting, not a ranking.

BenchmarkWhat it measuresTrust riskScore
SWE-Bench Pro1,865 tasks, 41 repos, multi-language (Python/Go/TS/JS), human-augmented specsLow — contamination-mitigated, independent leaderboard56.3% · 15th of 35
SWE-Bench Verified500 Python-only tasksElevated — frontier labs flag training-data contamination76.5% (vendor)
Terminal-Bench 2.189 sandboxed terminal tasks, pass@1, time-limitedVendor-stated — not yet seen on the public leaderboard59.5% (vendor)
ClawEval-1.1Tool-use / agentic capability evalVendor-stated, corroborated by a third-party review67.1% (vendor)
ToolathlonMulti-tool orchestration tasksVendor-stated49.5% (vendor)

Lead with SWE-Bench Pro. At a vendor-stated 56.3% it placed 15th on a 35-model leaderboard and 2nd among non-Anthropic models at release. Pro spans 1,865 tasks across 41 repositories in four languages, averaging 107 lines and four files changed per task, with a three-stage human-augmentation process and contamination mitigation. That breadth is exactly why it is harder to game than Verified.

Do not lead with SWE-Bench Verified. The vendor-stated 76.5% looks more impressive, but Verified is 500 Python-only tasks, and frontier labs — OpenAI among them — have flagged training-data contamination there to the point of discontinuing the score. A higher number on a more contaminated benchmark is the wrong signal to optimize for. The Terminal-Bench 2.1 figure of 59.5% is likewise vendor-stated and had not appeared on the public leaderboard at the time of writing; treat it as promising rather than confirmed.

Where it sits versus frontier
Step 3.7 Flash is not competing with the frontier coding tier. On the same SWE-Bench Pro leaderboard, Anthropic models led the field — Claude Opus 4.8 around 69%, with a preview model higher still. Step 3.7 Flash is filling a different niche: a large share of frontier coding capability at a small fraction of the cost, open-weight and locally deployable. Read it as a cost-efficiency play, not a frontier challenger.

05The Cost CaseThe math that makes this a production decision.

Here is the part most coverage buried. StepFun published an "Advisor Mode" comparison in which Step 3.7 Flash runs an agentic task for about $0.19 against roughly $1.76 for a larger frontier model in the same configuration — a stated 9.3× cost reduction at, per StepFun, comparable coding capability. Advisor Mode uses Step 3.7 Flash as the primary executor for iterative tool calls, escalating to a bigger model only at hard planning inflection points or after repeated failures.

Read those two numbers with care: both come from StepFun's own scaffold, against a model StepFun chose, with no independent cost audit. The directional claim — that a sparse-MoE executor priced at $0.20/M input and $1.15/M output is dramatically cheaper to loop than a frontier model — is well supported by the pricing alone. The exact 9.3× multiple is not something we would underwrite without running the comparison on your own tasks.

Advisor Mode (vendor-stated)Step 3.7 FlashLarger frontier model
Cost per agentic task~$0.19~$1.76
Input $/M (cache miss)$0.20Higher
Input $/M (cache hit)$0.04
Output $/M tokens$1.15Higher
Role in the loopPrimary executorEscalation only

The practical exercise is a breakeven calculation on your own volume. A team running 1,000 agentic tasks a month is talking about a small absolute difference; a team running 100,000 is choosing between a four-figure and a five-figure monthly bill for the same work. That is where an open-weight, cheap-to-loop executor stops being a curiosity and becomes a line-item decision. If you want that modeled against your actual workloads, our AI and digital transformation engagements start with exactly this kind of cost-and-capability eval. For the wider pricing picture, our Q2 2026 LLM API pricing index shows where Step 3.7 Flash's $0.20/M input sits in the current landscape.

"This design achieves a coding level comparable to Claude Opus 4.6 at 97% for an average cost of only $0.19 per task versus approximately $1.76 for the larger model."— Communeify technical analysis (citing StepFun's internal benchmarks)

06Reasoning EffortThree effort levels, set per call.

Step 3.7 Flash exposes three selectable reasoning levels — low, medium, and high — controlled by the output_config.effort parameter on the API. That lets a caller trade inference speed and cost against analytical depth on a per-request basis, which matters a great deal for a model whose entire pitch is high-frequency, cost-sensitive agent loops.

Fast
Low effort
effort: low

Simple, latency-sensitive tasks where speed and cost dominate. The right default for high-volume routine steps inside an agent loop.

Lowest cost · lowest latency
Default
Medium effort
effort: medium

General reasoning default. Balances depth and cost for typical tool-use and code tasks — the level most production loops will spend the bulk of their calls on.

Balanced default
Deep
High effort
effort: high

Complex math, planning, and code analysis. Step 3.7 Flash generates substantially more reasoning tokens at this level — verbose chain-of-thought is the cost of the extra depth.

Max depth · more tokens

A useful signal on consistency: StepFun reports that Step 3.7 Flash narrowed its performance variance across different agentic scaffolds from a wide 43–73% spread on the prior 3.5 Flash to a tighter 64.5–71.5%. For production teams, predictable behavior across harnesses often matters more than a peak score on one — a model that performs evenly whether it runs under Claude Code, Cline, Roo Code, or Kilo Code is easier to operate. An independent DGX Spark review reported a 100% tool-call success rate across its multi-step test scenarios, which is consistent with that framing, though it is a single hands-on result rather than a broad benchmark.

07Run It TodayCloud, open weights, or on your desk.

Three deployment paths are available, and the Apache 2.0 license means the on-prem route carries no usage restrictions. Pick the surface that matches the workload.

Surface
platform.stepfun.ai
Exposure
Hosted API · OpenAI-compatible
Best for
Production integration with image content parts and the effort parameter. Cache-hit pricing rewards repeated context. Also live on OpenRouter and NVIDIA NIM.
Surface
huggingface.co/stepfun-ai
Exposure
Open weights · Apache 2.0
Best for
On-prem and fine-tuning. BF16, FP8, NVFP4, and GGUF formats. Runs under vLLM, SGLang, Transformers v5+, llama.cpp, and TensorRT-LLM; NVIDIA offers day-0 NeMo fine-tuning.
Surface
Local GGUF (single box)
Exposure
IQ4_XS quant · ~105GB
Best for
Fits 128GB of unified memory — a Mac Studio Ultra or DGX Spark. Full BF16 is ~394GB and needs multi-GPU. Local quantized throughput is far lower than cloud (a review saw ~27 tok/s).

One caveat on throughput: the headline 400–415 tokens/second figures are cloud inference on NVIDIA hardware. Local quantized deployment is a different regime — an independent DGX Spark review measured roughly 27 tokens/second on quantized weights. The local story is about sovereignty, control, and zero per-token cost, not raw speed. The model is also compatible with the common agentic coding harnesses (Claude Code, Cline, Roo Code, Kilo Code), so slotting it in as a cheaper executor rarely means rebuilding your tooling. If you are mapping the broader harness landscape, see our Q2 2026 agentic coding platform matrix.

08Where It FitsThe decision tree for production teams.

Step 3.7 Flash does not change the frontier picture. It changes the economics of a few specific workload classes — and for those, it is worth a serious evaluation against your current stack.

High-volume agent loops
Cost-sensitive tool-use at scale

Thousands of agentic tasks a month where per-token cost compounds. The sparse-MoE economics and cache-hit pricing are the whole point. Benchmark Advisor-Mode-style executor patterns on your own tasks first.

Pick Step 3.7 Flash
Vision-driven agents
GUI, chart, and document parsing

Workflows that read screens, charts, receipts, or wireframes and act on them. The integrated 1.8B vision encoder removes a separate vision call. Validate the emergent tool-use claims on your data.

Strong candidate
Sovereignty / on-prem
Apache 2.0, locally deployable

Compliance-bound or air-gapped deployments. The IQ4_XS GGUF fits a single 128GB box, and the permissive license carries no usage limits — but plan for local throughput far below cloud rates.

Pick open weights
Hardest coding & reasoning
Frontier-tier capability

The most demanding code generation and reasoning still belongs to the frontier tier (Claude Opus 4.8 led the SWE-Bench Pro board). Keep a frontier model for the hard cases and escalate to it.

Stay with frontier

The cleanest production pattern is the one StepFun's own Advisor Mode gestures at: route the high-frequency, lower-stakes executor calls to Step 3.7 Flash and reserve a frontier model for planning inflection points and failures. That is multi-vendor routing by task class, and it is where an open-weight, cheap-to-loop model earns its place — not by beating the frontier, but by handling the 80% of calls that never needed the frontier in the first place. It sits naturally alongside the wider open-weight agentic wave that includes models like Kimi K2.5 and the open-source agent-swarm movement, and its 256K window is worth reading against the broader 2026 long-context comparison.

09ConclusionA cost-efficiency play, honestly framed.

The shape of open agentic models, May 2026

The benchmark is the headline everyone reads; the cost is the one that changes decisions.

Step 3.7 Flash is a well-engineered sparse MoE that does one thing unusually well: it runs agentic loops cheaply. A 198B model that activates ~11B per token, prices output at $1.15/M, ships under Apache 2.0, and adds a native vision encoder is a genuinely useful tool for high-frequency production work — provided you read its numbers correctly.

And reading them correctly means leading with SWE-Bench Pro (56.3%, the cleaner signal) over the contamination-flagged Verified score, treating the Terminal-Bench 2.1 and Advisor-Mode cost figures as vendor-stated until an independent harness confirms them, and never mistaking ~11B active compute for an 11B memory footprint. The directional case — frontier-adjacent coding capability at a small fraction of the cost — is strong. The precise multiples are not yet independently audited.

The broader signal is the one that matters most. The open-weight field is no longer trying only to match the frontier on raw capability; it is competing on cost-per-task for the enormous volume of agentic work that never needed a frontier model. Step 3.7 Flash is a clear, sober entry on that side of the line — and the right response is not to switch on a headline, but to run the breakeven and the eval on the workloads you actually care about.

Run open-weight agentic models in production

Cheap executors plus frontier escalation make agentic scale genuinely affordable.

Our team helps businesses evaluate, benchmark, and operate open-weight models like Step 3.7 Flash — building multi-vendor routing that puts cheap executors on high-frequency loops and reserves the frontier for the hard cases.

Free consultationExpert guidanceTailored solutions
What we work on

Open-weight model engagements

  • Cost-and-capability evals on your own agentic workloads
  • Multi-vendor routing — cheap executor + frontier escalation
  • Vision-agent pipelines — GUI, chart, and document parsing
  • On-prem & sovereign deployment of Apache-2.0 models
  • Benchmark trust audits — which number to actually believe
FAQ · Step 3.7 Flash guide

The questions we get every week.

Step 3.7 Flash is an open-weight agentic vision model from StepFun, released in late May 2026 (listed on OpenRouter May 28, with the StepFun blog and Hugging Face history pointing to May 29). It is a 198B-parameter sparse Mixture-of-Experts model — a 196B language backbone plus a 1.8B vision encoder — that activates only about 11B parameters per token. It ships under an Apache 2.0 license with weights on Hugging Face, and is available through the StepFun platform, OpenRouter, and NVIDIA NIM. It is the third model in StepFun's Step-3 family and the first multimodal entry in the Flash line.