AI DevelopmentNew Release11 min readPublished June 1, 2026

Multimodal budget agent · 1M context · ~6× cheaper than Qwen 3.7 Max

Qwen 3.7 Plus: Alibaba's Low-Cost Multimodal Agent Model GA

Qwen3.7-Plus is generally available as of June 1, 2026, after roughly 18 days of preview signal on the public arena. It adds vision and video understanding to Alibaba's agent backbone at about one-sixth the per-token price of Qwen 3.7 Max — and, notably, ships proprietary and API-only, breaking the lab's open-weight habit.

DA
Digital Applied Team
Senior strategists · Published Jun 1, 2026
PublishedJun 1, 2026
Read time11 min
Sources8 cited
Input price
$0.40/1M
tokens · output $1.60/1M
~6× under Max
ScreenSpot Pro
79.0
GUI grounding · vendor-stated
leads the table
Context window
1M
tokens · 65K max output
Intelligence Index
#53/164
Artificial Analysis · independent

Qwen 3.7 Plus is Alibaba's new low-cost multimodal agent model, generally available as of June 1, 2026. It bolts vision and video understanding onto the Qwen 3.7 text backbone, runs a one-million-token context window, and lists at roughly one-sixth the per-token price of the text-only Qwen 3.7 Max — a combination that positions it as a default for budget-sensitive agent pipelines.

The model first surfaced as Qwen3.7-Plus-Preview on the public LM Arena leaderboard around May 14, 2026, giving developers roughly 18 days of live inference signal before the commercial endpoint dropped the "-Preview" suffix at GA. That preview-then-commercialise cadence is becoming the standard release shape for Chinese AI labs, and it is part of what makes this launch worth a close read rather than a headline skim.

This guide covers what actually shipped, where the pricing sits in the multimodal budget tier, the GUI-grounding cost math that makes it interesting for automation teams, how to read its vendor-run benchmarks honestly, the independent signals that do exist, and the strategic pivot hiding in the "proprietary" label. Every benchmark table below carries an explicit vendor-stated caveat where it applies.

Key takeaways
  1. 01
    Vision and video, on the Max text backbone.Qwen 3.7 Plus accepts text, image, and video inputs and outputs text. It is positioned as a perception-and-reasoning agent, not a generative vision model — it reads screens and scenes but does not create images or video.
  2. 02
    Priced as a budget agent — roughly 6× under Max.List pricing is $0.40 input / $1.60 output per 1M tokens, versus Qwen 3.7 Max at $2.50 / $7.50. Cached-input pricing is reported in a $0.04–$0.08/1M range depending on source — verify on Model Studio before modelling spend.
  3. 03
    GUI grounding is the standout, on a vendor harness.Qwen reports 79.0 on ScreenSpot Pro, ahead of GPT-5.4 (67.4) and Claude Opus 4.6 (49.5) in its own table. That score is vendor-stated and was run with thinking disabled — treat it as a directional signal, not a settled fact.
  4. 04
    It ships proprietary and API-only.No open-weight checkpoints were published on Hugging Face at launch — a real departure from Alibaba's open-source Qwen lineage. Reports of a Q3 open-weight variant are third-party speculation and unconfirmed by Alibaba.
  5. 05
    Independent signal is thin but positive.The only third-party reads are Artificial Analysis (Intelligence Index #53 of 164, but slow at ~52.9 tokens/sec) and LM Arena rankings. Everything else in the announcement is Qwen evaluating Qwen.

01What ShippedA multimodal agent that perceives but does not generate.

Qwen 3.7 Plus is the multimodal sibling of Qwen 3.7 Max. Where Max is a text-only flagship, Plus extends the same agentic backbone with vision and video understanding while keeping the coding, tool-use, and productivity strengths of the text model. Crucially, it is a perception model, not a generative one: it accepts text, images, and video as input and returns text only. It can read a screen, ground a click target, or answer a question about a frame of video — but it will not draw you a picture.

The model is exposed as the API endpoint qwen3.7-plus on Alibaba Cloud Model Studio (DashScope), reachable through OpenAI-compatible chat-completions and responses APIs across Beijing, Singapore, and US-Virginia regional endpoints. Alibaba's own framing positions it as a single agent that blends GUI and command-line interaction inside one loop. For the text-only counterpart, see our deep dive on the text-only Qwen 3.7 Max flagship.

Multimodal agent
Qwen 3.7 Plus
text + image + video in · text out

Vision and video understanding on the Qwen 3.7 agent backbone. Reads screens, grounds GUI click targets, navigates mobile apps, and writes code from visual references. $0.40 / $1.60 per 1M tokens.

endpoint: qwen3.7-plus
Text flagship
Qwen 3.7 Max
text in · text out

The premium text-only baseline. Scores 60.6 on SWE-Bench Pro (vendor-stated) versus Plus at roughly 57.6 — about a 3-point quality gap on pure-text software engineering. $2.50 / $7.50 per 1M tokens.

~6× the input price of Plus
Release snapshot
Qwen 3.7 Plus reached general availability June 1, 2026 on Alibaba Cloud Model Studio, after a Qwen3.7-Plus-Preview appeared on the public LM Arena leaderboard around May 14. The qwen.ai blog post carries an earlier internal date of May 21, but third-party coverage uniformly places the GA in the first days of June — so the safe read is preview from mid-May, GA on June 1. The model is proprietary and API-only; no open weights shipped with it.
Today we introduce Qwen3.7-Plus, a multimodal agent model that unifies vision and language into a single, versatile agent foundation.— Qwen Team, Alibaba (official launch post)

02PricingThe real headline is price, not raw capability.

The story of Qwen 3.7 Plus is economic, not heroic. At a list price of $0.40 per 1M input tokens and $1.60 per 1M output tokens, it costs roughly 6× less on input and 5× less on output than Qwen 3.7 Max ($2.50 / $7.50). VentureBeat positioned it as priced "just above" MiniMax M3's limited-time discount pricing, and the independent OpenRouter and Artificial Analysis listings corroborate the input/output figures. Cached-input pricing is the one number to treat with caution: OpenRouter and Artificial Analysis show $0.08/1M, while VentureBeat cites $0.04/1M from the Model Studio pricing page. Until the live page settles, model it as a $0.04–$0.08 range rather than a fixed rate.

Input price per 1M tokens · multimodal budget tier

Sources: Model Studio, OpenRouter, Artificial Analysis, VentureBeat
Qwen 3.7 Maxtext-only flagship · premium baseline
$2.50/1M
Qwen 3.7 Plusmultimodal · input price
$0.40/1M
MiniMax M3open-weight budget rival · list input
$0.30/1M
Qwen 3.7 Plus (cached)cached-input range · verify on Model Studio
$0.04–0.08

For agent workloads, input price dominates total cost because tool loops, screenshots, and long context windows pour tokens through the prompt side far faster than the model emits them. A model that reads a full screenshot, a system prompt, and a running scratchpad on every step is paying input tokens dozens of times per task. Cutting that side from $2.50 to $0.40 per million is the difference between an agent that is economical to run at scale and one that is a demo. This is the same FinOps logic we lay out in our AI inference cost optimization playbook.

Multimodal budget tier — June 2026 (list pricing; benchmarks vendor-stated)
ModelIn $/1MOut $/1MContextModalitiesOpen weights
Qwen 3.7 Plus$0.40$1.601MText · image · video inNo (API-only)
MiniMax M3$0.30$1.201MTextYes
DeepSeek V4 Flash$0.14$0.281MTextYes
Qwen 3.7 Max$2.50$7.501MTextNo (API-only)

Sources: VentureBeat pricing snapshot (Jun 2026), OpenRouter, Artificial Analysis, Digital Applied fact-pack §1.2. Pricing is list rate; promotional discounts may apply. Qwen 3.7 Plus is the only multimodal entry in this tier.

03Grounding EconomicsGUI grounding points per dollar.

The single most interesting number Qwen reports is its ScreenSpot Pro score — a GUI-grounding benchmark that measures whether a model can name the exact pixel coordinates of a click target on a screenshot. Qwen claims 79.0, ahead of GPT-5.4 at 67.4 and Claude Opus 4.6 at 49.5 in its own table. That number is vendor-stated and was measured with thinking disabled, so it should anchor a hypothesis, not a purchase order. But it points at the real reason this model matters for robotic-process-automation and browser-automation teams: it pairs a high grounding score with a low price.

So we built a metric the vendor announcements and the aggregator sites do not publish. Take each model's ScreenSpot Pro score and divide it by its input price per million tokens — call it grounding points per dollar of input. It is a crude proxy, but it answers the operative question for automation teams: which model gives you the most click accuracy per dollar? On vendor-stated grounding scores against list input pricing, the gap is not close.

GUI grounding cost-effectiveness — derived metric (Digital Applied)
ModelScreenSpot ProInput $/1MPoints / $1 inputEfficiency rank
Qwen 3.7 Plus79.0$0.40~1981st
Qwen 3.6 Plus68.2n/p
Gemini 3.1 Pro68.1n/p
GPT-5.4 (xhigh)67.4n/p
Claude Opus 4.6 Max49.5n/p

ScreenSpot Pro scores from the qwen.ai benchmark table (VENDOR-STATED, run with thinking disabled). "Points / $1" divides the grounding score by the per-million input price; we only compute it for Qwen 3.7 Plus, whose list price is published. Cross-vendor pricing for competitor GUI modes was not published comparably (n/p), so we deliberately leave those cells blank rather than compute against mismatched price points.

Read the asymmetry
ScreenSpot Pro and OSWorld were evaluated with thinking disabledon Qwen's own harness. Comparing those numbers against models that may have been run with reasoning enabled is not apples-to-apples. The grounding-per-dollar story is directionally compelling — a budget model that reads screens well — but it rests on vendor-run scores. Validate on your own screenshots before you wire it into an RPA pipeline.

04Vendor BenchmarksWhere it leads, where it trails — all vendor-run.

The launch post is dense with benchmark tables, and nearly all of them were produced by Qwen evaluating Qwen — frequently on internal harnesses where the lab is both the test subject and the operator. We reproduce a representative selection below, with the model leading on agentic and long-context retrieval tasks and conceding ground on pure-text software engineering. Orange bars mark Qwen 3.7 Plus where it leads its own comparison; blue bars mark where a frontier competitor leads.

Qwen 3.7 Plus vs frontier · selected vendor benchmarks

Source: qwen.ai benchmark tables — VENDOR-STATED
AndroidWorldmobile navigation · vs Gemini 3.1 Pro 70.7
81.0
Plus leads
ScreenSpot ProGUI grounding · vs GPT-5.4 67.4
79.0
Plus leads
Terminal-Bench 2.0Terminus harness · vs DeepSeek V4-Pro 67.9
70.3
Plus leads
MRCR-v2 128klong-context retrieval · vs Opus 4.6 Max 84.0
91.7
Plus leads
OSWorld-Verifieddesktop GUI · vs GPT-5.4 75.0
73.3
GPT-5.4
SWE-Bench Verifiedvs Opus 4.6 Max 80.8 · DeepSeek V4-Pro 80.6
77.7
Opus 4.6
SWE-Bench Provs Qwen 3.7 Max 60.6 · text SWE gap
~57.6
Qwen 3.7 Max
Qwen 3.7 Plus leadsCompetitor leads

The pattern is coherent. Qwen 3.7 Plus is strongest exactly where its positioning claims — agentic GUI and mobile tasks, terminal work, and long-context retrieval — and it gives up roughly 3 points of SWE-Bench Pro to its own Max sibling on pure-text software engineering. That is the expected trade for a cheaper, multimodal variant: you buy perception and price, you pay a little on text-only coding depth. On frontier-grade STEM the gap nearly closes, with a vendor-stated GPQA Diamond of 90.3 sitting within a point of Opus 4.6 Max and DeepSeek V4-Pro Max.

Long-context retrieval
MRCR-v2 128k
91.7

Highest in the vendor text table, ahead of Qwen 3.6 Plus (85.9), Opus 4.6 Max (84.0), and DeepSeek V4-Pro Max (74.4). The number that justifies the 1M-token window — though it is still a Qwen-run evaluation.

vendor-stated
STEM reasoning
GPQA Diamond
90.3

Within a point of Opus 4.6 Max (91.3) and DeepSeek V4-Pro Max (90.1). Frontier-grade graduate STEM performance despite the budget price point. Vendor-run, but a flattering result if it holds independently.

vendor-stated
Text software engineering
SWE-Bench Pro
57.6

About 3 points behind Qwen 3.7 Max (60.6), and below Kimi K2.6 Thinking (59.5) and Opus 4.6 Max (57.3). The clearest measurable quality gap — Plus trades pure-text coding depth for multimodality and price.

the trade-off

05Independent SignalsThe two reads that are not vendor-run.

Strip away the vendor tables and only two independent signals remain, and both are worth more than the rest combined precisely because Qwen did not produce them. Artificial Analysis placed Qwen 3.7 Plus at #53 of 164 models on its Intelligence Index— its phrasing is "well above average" rather than frontier — with an output speed of about 52.9 tokens per second, which ranks roughly #101 of 164 and is notably slow, and a time-to-first-token near 2.3 seconds. The same evaluation flagged unusually high verbosity: the model emitted about 110M output tokens during testing against a 29M median, which matters because output tokens are the expensive side of the bill.

The second signal is LM Arena, where coverage placed the model around #15 on text, #12 on coding, and #16 on vision — enough for one outlet to rank Alibaba as the #5 lab in vision research at launch. Arena rankings fluctuate daily and these were reported in secondary coverage, so treat them as a snapshot rather than a standing. The honest synthesis: independent data confirms Qwen 3.7 Plus is a capable mid-pack model that is cheap and slightly slow — not the across-the-board leader the vendor tables imply.

It is among the cheaper powerful AI models available now, coming in price-wise just above Chinese rival MiniMax M3's limited-time discount pricing.— Carl Franzen, VentureBeat
The verbosity tax
Artificial Analysis measured Qwen 3.7 Plus generating roughly 110M output tokens in evaluation against a 29M median. Output is the costly side at $1.60/1M, and the model is comparatively slow at ~52.9 tokens/sec. For chatty agent loops, that verbosity can quietly erode the cheap-input advantage — budget for it, and consider tightening response constraints in production.

06StrategyThe most consequential line is proprietary.

Most coverage files "proprietary, API-only" as a footnote. We think it is the headline. Alibaba built the Qwen moat on permissively licensed open weights — US enterprises including Airbnb reportedly adopted earlier open-weight Qwen models, and that distribution was a strategic asset. Qwen 3.7 Plus ships with no open-weight checkpoints on Hugging Face, and Qwen 3.7 Max is closed as well. That is a deliberate pivot toward a closed-frontier posture, and it changes the calculus for any team that chose Qwen specifically because it could self-host.

Be precise about what is and is not known here. Alibaba has not confirmed open weights for the 3.7 family, and it has not denied them either. Reports of a Q3 2026 open-weight variant are third-party speculation, not an announced roadmap— do not plan around them. For the broader arc of the lab's release strategy, our retrospective on Alibaba's open-weight model history traces how a once open-by-default lab arrived here, and our framing of the wider closed vs open-weight AI trade-offs lays out what a closed Qwen costs sovereignty-bound buyers.

What sets Qwen3.7-Plus apart is its ability to operate as a multimodal interactive hybrid agent — it perceives real-world scenes, reads screens and operates GUIs, and writes code from visual references.— Qwen Team, Alibaba (official launch post)

07Agentic Features1M context, preserved thinking, cross-framework.

Three engineering choices make Qwen 3.7 Plus a genuine agent platform rather than a chat model with vision bolted on. First, the 1,000,000-token context window with up to 65,536 tokens of output and an internal chain-of-thought budget reported up to 256K tokens — enough headroom to hold a large codebase, a screenshot history, and a running plan in a single loop. Second, the preserve_thinking API parameter, which retains <think>blocks across conversation turns so a long-horizon task does not reset the model's reasoning chain on every tool call.

That last feature is quietly significant because it shows the whole frontier converging on the same idea. Anthropic ships Extended Thinking that returns reasoning blocks, OpenAI passes encrypted reasoning state back across turns, and Alibaba now exposes preserve_thinking. Multi-turn reasoning-state preservation has become table stakes for serious agent models — and Qwen 3.7 Plus offers it at the cheapest price of the three. Third, the model is positioned to generalize across agent scaffolds — Qwen cites Claude Code, OpenClaw, and Qwen Code among the frameworks it targets, though that cross-framework claim is vendor-stated and worth verifying against your own scaffold.

Context
1M-token window
1,000,000 in · 65,536 max out

Holds large codebases, screenshot histories, and running plans in one loop. Internal chain-of-thought budget reported up to 256K tokens (vendor-stated). MRCR-v2 128k retrieval of 91.7 backs the long-context claim.

1M context
Reasoning state
preserve_thinking
<think> blocks retained across turns

Carries reasoning chains across tool calls so long-horizon agent tasks do not reset every turn. The same converged capability Anthropic and OpenAI ship — offered here at the lowest price of the three.

API parameter
Scaffolds
Cross-framework
OpenAI-compatible · multi-region

Reachable via OpenAI-compatible chat and responses APIs across Beijing, Singapore, and US-Virginia. Qwen says it generalizes across Claude Code, OpenClaw, and Qwen Code — a vendor-stated claim to validate on your stack.

endpoint: qwen3.7-plus

08DecisionWhen Qwen 3.7 Plus is the right default.

The model is not a universal answer. It is a sharp fit for a specific set of workloads and a poor one for others. Here is how we would route it across the four cases that come up most in client engagements.

Visual automation
Screen reading & GUI grounding

High vendor-stated ScreenSpot Pro at budget pricing makes this the headline use case for RPA and browser-automation agents. Validate the grounding score on your own screenshots first — vendor harness, thinking disabled — then exploit the cheap input loop.

Pick Qwen 3.7 Plus
Long-context retrieval
Budget document agents

1M context plus a vendor-stated 91.7 MRCR-v2 128k make it a strong, cheap option for high-volume long-document Q&A — provided you do not need self-hosting, since it ships API-only with no open weights.

Strong candidate
Pure-text coding
Heavy software engineering

On SWE-Bench Pro it trails its own Max sibling by ~3 points and sits below Opus 4.6 Max and Kimi K2.6 Thinking. For text-only coding depth, route to a stronger coder; reserve Plus for tasks where vision or price is the deciding factor.

Route elsewhere
Sovereign deployment
Self-hosted / air-gapped

Qwen 3.7 Plus is proprietary and API-only — no weights to deploy on-prem. If self-hosting is a hard requirement, an open-weight rival like MiniMax M3 or DeepSeek V4 fits better today; do not bet on an unconfirmed open Qwen variant.

Pick open weights

Our standing recommendation is to treat Qwen 3.7 Plus as a routing target, not a default. Send screen-reading, GUI-grounding, and cheap-bulk long-context work its way; keep a stronger text coder for heavy software engineering; and reach for an open-weight model when sovereignty or self-hosting is non-negotiable. If you are deciding between budget multimodal options head-to-head, our coverage of MiniMax M3 is the natural companion read. And if you want this benchmarked against your own corpus and pipelines rather than a vendor table, our AI digital transformation engagements start with exactly that comparative eval.

09ConclusionA budget multimodal agent — with caveats attached.

The shape of the budget agent tier, June 2026

The real story is price and a quiet pivot to closed weights.

Qwen 3.7 Plus is the most interesting budget multimodal model of the moment, but for reasons the headlines mostly miss. It is not that the benchmark numbers are extraordinary — most of them are vendor-run, and the one independent read places it well above average rather than frontier. It is that the model pairs genuine vision-and-video agent capability with a price roughly 6× under its own flagship, on the side of the ledger — input tokens — where agents spend the most.

The honest framing is the useful one. On vendor harnesses it leads on GUI grounding, mobile navigation, and long-context retrieval; it trails on pure-text software engineering; and the only third-party data confirms a capable, cheap, slightly slow mid-pack model. Treat the grounding-per-dollar advantage as a hypothesis to validate on your own screenshots, not a settled win — and budget for the verbosity tax on the output side.

The strategic signal sits underneath all of it: a lab that built its reputation on open weights has shipped a closed, API-only frontier model — and there is no confirmed open-weight variant on the way. For teams that chose Qwen to self-host, that is the line in this release that actually changes the plan. Run your own evals, price the whole loop including output, and decide per-workload rather than per-headline.

Put budget multimodal agents to work

Cheap input plus screen reading make visual automation economically viable.

Our team helps businesses evaluate, benchmark, and operate frontier AI models — open and closed — for visual automation, long-context retrieval, and agentic pipelines, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Multimodal agent engagements

  • GUI-grounding & RPA agents benchmarked on your screens
  • Long-context document agents — cost-optimized pipelines
  • Multi-vendor routing — Qwen / GPT-5.5 / Opus / open weights
  • Inference FinOps — pricing the whole agent loop
  • Open vs closed model strategy for sovereignty-bound teams
FAQ · Qwen 3.7 Plus guide

The questions we get every week.

Qwen 3.7 Plus is Alibaba's multimodal agent model — it adds vision and video understanding to the Qwen 3.7 text backbone while keeping its coding, tool-use, and productivity strengths. It accepts text, image, and video as input and returns text only. It reached general availability on June 1, 2026, after a Qwen3.7-Plus-Preview appeared on the public LM Arena leaderboard around May 14 — roughly 18 days of preview signal before the commercial endpoint dropped the '-Preview' suffix. The qwen.ai blog carries an earlier internal date of May 21, but third-party coverage uniformly places the GA in the first days of June. It is served as the API endpoint qwen3.7-plus on Alibaba Cloud Model Studio.