Qwen 3.7 Plus is Alibaba's new low-cost multimodal agent model, generally available as of June 1, 2026. It bolts vision and video understanding onto the Qwen 3.7 text backbone, runs a one-million-token context window, and lists at roughly one-sixth the per-token price of the text-only Qwen 3.7 Max — a combination that positions it as a default for budget-sensitive agent pipelines.
The model first surfaced as Qwen3.7-Plus-Preview on the public LM Arena leaderboard around May 14, 2026, giving developers roughly 18 days of live inference signal before the commercial endpoint dropped the "-Preview" suffix at GA. That preview-then-commercialise cadence is becoming the standard release shape for Chinese AI labs, and it is part of what makes this launch worth a close read rather than a headline skim.
This guide covers what actually shipped, where the pricing sits in the multimodal budget tier, the GUI-grounding cost math that makes it interesting for automation teams, how to read its vendor-run benchmarks honestly, the independent signals that do exist, and the strategic pivot hiding in the "proprietary" label. Every benchmark table below carries an explicit vendor-stated caveat where it applies.
- 01Vision and video, on the Max text backbone.Qwen 3.7 Plus accepts text, image, and video inputs and outputs text. It is positioned as a perception-and-reasoning agent, not a generative vision model — it reads screens and scenes but does not create images or video.
- 02Priced as a budget agent — roughly 6× under Max.List pricing is $0.40 input / $1.60 output per 1M tokens, versus Qwen 3.7 Max at $2.50 / $7.50. Cached-input pricing is reported in a $0.04–$0.08/1M range depending on source — verify on Model Studio before modelling spend.
- 03GUI grounding is the standout, on a vendor harness.Qwen reports 79.0 on ScreenSpot Pro, ahead of GPT-5.4 (67.4) and Claude Opus 4.6 (49.5) in its own table. That score is vendor-stated and was run with thinking disabled — treat it as a directional signal, not a settled fact.
- 04It ships proprietary and API-only.No open-weight checkpoints were published on Hugging Face at launch — a real departure from Alibaba's open-source Qwen lineage. Reports of a Q3 open-weight variant are third-party speculation and unconfirmed by Alibaba.
- 05Independent signal is thin but positive.The only third-party reads are Artificial Analysis (Intelligence Index #53 of 164, but slow at ~52.9 tokens/sec) and LM Arena rankings. Everything else in the announcement is Qwen evaluating Qwen.
01 — What ShippedA multimodal agent that perceives but does not generate.
Qwen 3.7 Plus is the multimodal sibling of Qwen 3.7 Max. Where Max is a text-only flagship, Plus extends the same agentic backbone with vision and video understanding while keeping the coding, tool-use, and productivity strengths of the text model. Crucially, it is a perception model, not a generative one: it accepts text, images, and video as input and returns text only. It can read a screen, ground a click target, or answer a question about a frame of video — but it will not draw you a picture.
The model is exposed as the API endpoint qwen3.7-plus on Alibaba Cloud Model Studio (DashScope), reachable through OpenAI-compatible chat-completions and responses APIs across Beijing, Singapore, and US-Virginia regional endpoints. Alibaba's own framing positions it as a single agent that blends GUI and command-line interaction inside one loop. For the text-only counterpart, see our deep dive on the text-only Qwen 3.7 Max flagship.
Qwen 3.7 Plus
Vision and video understanding on the Qwen 3.7 agent backbone. Reads screens, grounds GUI click targets, navigates mobile apps, and writes code from visual references. $0.40 / $1.60 per 1M tokens.
Qwen 3.7 Max
The premium text-only baseline. Scores 60.6 on SWE-Bench Pro (vendor-stated) versus Plus at roughly 57.6 — about a 3-point quality gap on pure-text software engineering. $2.50 / $7.50 per 1M tokens.
Today we introduce Qwen3.7-Plus, a multimodal agent model that unifies vision and language into a single, versatile agent foundation.— Qwen Team, Alibaba (official launch post)
02 — PricingThe real headline is price, not raw capability.
The story of Qwen 3.7 Plus is economic, not heroic. At a list price of $0.40 per 1M input tokens and $1.60 per 1M output tokens, it costs roughly 6× less on input and 5× less on output than Qwen 3.7 Max ($2.50 / $7.50). VentureBeat positioned it as priced "just above" MiniMax M3's limited-time discount pricing, and the independent OpenRouter and Artificial Analysis listings corroborate the input/output figures. Cached-input pricing is the one number to treat with caution: OpenRouter and Artificial Analysis show $0.08/1M, while VentureBeat cites $0.04/1M from the Model Studio pricing page. Until the live page settles, model it as a $0.04–$0.08 range rather than a fixed rate.
Input price per 1M tokens · multimodal budget tier
Sources: Model Studio, OpenRouter, Artificial Analysis, VentureBeatFor agent workloads, input price dominates total cost because tool loops, screenshots, and long context windows pour tokens through the prompt side far faster than the model emits them. A model that reads a full screenshot, a system prompt, and a running scratchpad on every step is paying input tokens dozens of times per task. Cutting that side from $2.50 to $0.40 per million is the difference between an agent that is economical to run at scale and one that is a demo. This is the same FinOps logic we lay out in our AI inference cost optimization playbook.
| Model | In $/1M | Out $/1M | Context | Modalities | Open weights |
|---|---|---|---|---|---|
| Qwen 3.7 Plus | $0.40 | $1.60 | 1M | Text · image · video in | No (API-only) |
| MiniMax M3 | $0.30 | $1.20 | 1M | Text | Yes |
| DeepSeek V4 Flash | $0.14 | $0.28 | 1M | Text | Yes |
| Qwen 3.7 Max | $2.50 | $7.50 | 1M | Text | No (API-only) |
Sources: VentureBeat pricing snapshot (Jun 2026), OpenRouter, Artificial Analysis, Digital Applied fact-pack §1.2. Pricing is list rate; promotional discounts may apply. Qwen 3.7 Plus is the only multimodal entry in this tier.
03 — Grounding EconomicsGUI grounding points per dollar.
The single most interesting number Qwen reports is its ScreenSpot Pro score — a GUI-grounding benchmark that measures whether a model can name the exact pixel coordinates of a click target on a screenshot. Qwen claims 79.0, ahead of GPT-5.4 at 67.4 and Claude Opus 4.6 at 49.5 in its own table. That number is vendor-stated and was measured with thinking disabled, so it should anchor a hypothesis, not a purchase order. But it points at the real reason this model matters for robotic-process-automation and browser-automation teams: it pairs a high grounding score with a low price.
So we built a metric the vendor announcements and the aggregator sites do not publish. Take each model's ScreenSpot Pro score and divide it by its input price per million tokens — call it grounding points per dollar of input. It is a crude proxy, but it answers the operative question for automation teams: which model gives you the most click accuracy per dollar? On vendor-stated grounding scores against list input pricing, the gap is not close.
| Model | ScreenSpot Pro | Input $/1M | Points / $1 input | Efficiency rank |
|---|---|---|---|---|
| Qwen 3.7 Plus | 79.0 | $0.40 | ~198 | 1st |
| Qwen 3.6 Plus | 68.2 | n/p | — | — |
| Gemini 3.1 Pro | 68.1 | n/p | — | — |
| GPT-5.4 (xhigh) | 67.4 | n/p | — | — |
| Claude Opus 4.6 Max | 49.5 | n/p | — | — |
ScreenSpot Pro scores from the qwen.ai benchmark table (VENDOR-STATED, run with thinking disabled). "Points / $1" divides the grounding score by the per-million input price; we only compute it for Qwen 3.7 Plus, whose list price is published. Cross-vendor pricing for competitor GUI modes was not published comparably (n/p), so we deliberately leave those cells blank rather than compute against mismatched price points.
04 — Vendor BenchmarksWhere it leads, where it trails — all vendor-run.
The launch post is dense with benchmark tables, and nearly all of them were produced by Qwen evaluating Qwen — frequently on internal harnesses where the lab is both the test subject and the operator. We reproduce a representative selection below, with the model leading on agentic and long-context retrieval tasks and conceding ground on pure-text software engineering. Orange bars mark Qwen 3.7 Plus where it leads its own comparison; blue bars mark where a frontier competitor leads.
Qwen 3.7 Plus vs frontier · selected vendor benchmarks
Source: qwen.ai benchmark tables — VENDOR-STATEDThe pattern is coherent. Qwen 3.7 Plus is strongest exactly where its positioning claims — agentic GUI and mobile tasks, terminal work, and long-context retrieval — and it gives up roughly 3 points of SWE-Bench Pro to its own Max sibling on pure-text software engineering. That is the expected trade for a cheaper, multimodal variant: you buy perception and price, you pay a little on text-only coding depth. On frontier-grade STEM the gap nearly closes, with a vendor-stated GPQA Diamond of 90.3 sitting within a point of Opus 4.6 Max and DeepSeek V4-Pro Max.
MRCR-v2 128k
Highest in the vendor text table, ahead of Qwen 3.6 Plus (85.9), Opus 4.6 Max (84.0), and DeepSeek V4-Pro Max (74.4). The number that justifies the 1M-token window — though it is still a Qwen-run evaluation.
GPQA Diamond
Within a point of Opus 4.6 Max (91.3) and DeepSeek V4-Pro Max (90.1). Frontier-grade graduate STEM performance despite the budget price point. Vendor-run, but a flattering result if it holds independently.
SWE-Bench Pro
About 3 points behind Qwen 3.7 Max (60.6), and below Kimi K2.6 Thinking (59.5) and Opus 4.6 Max (57.3). The clearest measurable quality gap — Plus trades pure-text coding depth for multimodality and price.
05 — Independent SignalsThe two reads that are not vendor-run.
Strip away the vendor tables and only two independent signals remain, and both are worth more than the rest combined precisely because Qwen did not produce them. Artificial Analysis placed Qwen 3.7 Plus at #53 of 164 models on its Intelligence Index— its phrasing is "well above average" rather than frontier — with an output speed of about 52.9 tokens per second, which ranks roughly #101 of 164 and is notably slow, and a time-to-first-token near 2.3 seconds. The same evaluation flagged unusually high verbosity: the model emitted about 110M output tokens during testing against a 29M median, which matters because output tokens are the expensive side of the bill.
The second signal is LM Arena, where coverage placed the model around #15 on text, #12 on coding, and #16 on vision — enough for one outlet to rank Alibaba as the #5 lab in vision research at launch. Arena rankings fluctuate daily and these were reported in secondary coverage, so treat them as a snapshot rather than a standing. The honest synthesis: independent data confirms Qwen 3.7 Plus is a capable mid-pack model that is cheap and slightly slow — not the across-the-board leader the vendor tables imply.
It is among the cheaper powerful AI models available now, coming in price-wise just above Chinese rival MiniMax M3's limited-time discount pricing.— Carl Franzen, VentureBeat
06 — StrategyThe most consequential line is proprietary.
Most coverage files "proprietary, API-only" as a footnote. We think it is the headline. Alibaba built the Qwen moat on permissively licensed open weights — US enterprises including Airbnb reportedly adopted earlier open-weight Qwen models, and that distribution was a strategic asset. Qwen 3.7 Plus ships with no open-weight checkpoints on Hugging Face, and Qwen 3.7 Max is closed as well. That is a deliberate pivot toward a closed-frontier posture, and it changes the calculus for any team that chose Qwen specifically because it could self-host.
Be precise about what is and is not known here. Alibaba has not confirmed open weights for the 3.7 family, and it has not denied them either. Reports of a Q3 2026 open-weight variant are third-party speculation, not an announced roadmap— do not plan around them. For the broader arc of the lab's release strategy, our retrospective on Alibaba's open-weight model history traces how a once open-by-default lab arrived here, and our framing of the wider closed vs open-weight AI trade-offs lays out what a closed Qwen costs sovereignty-bound buyers.
What sets Qwen3.7-Plus apart is its ability to operate as a multimodal interactive hybrid agent — it perceives real-world scenes, reads screens and operates GUIs, and writes code from visual references.— Qwen Team, Alibaba (official launch post)
07 — Agentic Features1M context, preserved thinking, cross-framework.
Three engineering choices make Qwen 3.7 Plus a genuine agent platform rather than a chat model with vision bolted on. First, the 1,000,000-token context window with up to 65,536 tokens of output and an internal chain-of-thought budget reported up to 256K tokens — enough headroom to hold a large codebase, a screenshot history, and a running plan in a single loop. Second, the preserve_thinking API parameter, which retains <think>blocks across conversation turns so a long-horizon task does not reset the model's reasoning chain on every tool call.
That last feature is quietly significant because it shows the whole frontier converging on the same idea. Anthropic ships Extended Thinking that returns reasoning blocks, OpenAI passes encrypted reasoning state back across turns, and Alibaba now exposes preserve_thinking. Multi-turn reasoning-state preservation has become table stakes for serious agent models — and Qwen 3.7 Plus offers it at the cheapest price of the three. Third, the model is positioned to generalize across agent scaffolds — Qwen cites Claude Code, OpenClaw, and Qwen Code among the frameworks it targets, though that cross-framework claim is vendor-stated and worth verifying against your own scaffold.
1M-token window
Holds large codebases, screenshot histories, and running plans in one loop. Internal chain-of-thought budget reported up to 256K tokens (vendor-stated). MRCR-v2 128k retrieval of 91.7 backs the long-context claim.
preserve_thinking
Carries reasoning chains across tool calls so long-horizon agent tasks do not reset every turn. The same converged capability Anthropic and OpenAI ship — offered here at the lowest price of the three.
Cross-framework
Reachable via OpenAI-compatible chat and responses APIs across Beijing, Singapore, and US-Virginia. Qwen says it generalizes across Claude Code, OpenClaw, and Qwen Code — a vendor-stated claim to validate on your stack.
08 — DecisionWhen Qwen 3.7 Plus is the right default.
The model is not a universal answer. It is a sharp fit for a specific set of workloads and a poor one for others. Here is how we would route it across the four cases that come up most in client engagements.
Screen reading & GUI grounding
High vendor-stated ScreenSpot Pro at budget pricing makes this the headline use case for RPA and browser-automation agents. Validate the grounding score on your own screenshots first — vendor harness, thinking disabled — then exploit the cheap input loop.
Budget document agents
1M context plus a vendor-stated 91.7 MRCR-v2 128k make it a strong, cheap option for high-volume long-document Q&A — provided you do not need self-hosting, since it ships API-only with no open weights.
Heavy software engineering
On SWE-Bench Pro it trails its own Max sibling by ~3 points and sits below Opus 4.6 Max and Kimi K2.6 Thinking. For text-only coding depth, route to a stronger coder; reserve Plus for tasks where vision or price is the deciding factor.
Self-hosted / air-gapped
Qwen 3.7 Plus is proprietary and API-only — no weights to deploy on-prem. If self-hosting is a hard requirement, an open-weight rival like MiniMax M3 or DeepSeek V4 fits better today; do not bet on an unconfirmed open Qwen variant.
Our standing recommendation is to treat Qwen 3.7 Plus as a routing target, not a default. Send screen-reading, GUI-grounding, and cheap-bulk long-context work its way; keep a stronger text coder for heavy software engineering; and reach for an open-weight model when sovereignty or self-hosting is non-negotiable. If you are deciding between budget multimodal options head-to-head, our coverage of MiniMax M3 is the natural companion read. And if you want this benchmarked against your own corpus and pipelines rather than a vendor table, our AI digital transformation engagements start with exactly that comparative eval.
09 — ConclusionA budget multimodal agent — with caveats attached.
The real story is price and a quiet pivot to closed weights.
Qwen 3.7 Plus is the most interesting budget multimodal model of the moment, but for reasons the headlines mostly miss. It is not that the benchmark numbers are extraordinary — most of them are vendor-run, and the one independent read places it well above average rather than frontier. It is that the model pairs genuine vision-and-video agent capability with a price roughly 6× under its own flagship, on the side of the ledger — input tokens — where agents spend the most.
The honest framing is the useful one. On vendor harnesses it leads on GUI grounding, mobile navigation, and long-context retrieval; it trails on pure-text software engineering; and the only third-party data confirms a capable, cheap, slightly slow mid-pack model. Treat the grounding-per-dollar advantage as a hypothesis to validate on your own screenshots, not a settled win — and budget for the verbosity tax on the output side.
The strategic signal sits underneath all of it: a lab that built its reputation on open weights has shipped a closed, API-only frontier model — and there is no confirmed open-weight variant on the way. For teams that chose Qwen to self-host, that is the line in this release that actually changes the plan. Run your own evals, price the whole loop including output, and decide per-workload rather than per-headline.