Qwen 3.5 Medium Models: Benchmarks, Pricing, and Complete Guide
Alibaba's Qwen team just dropped four medium-sized models that rewrite the cost-performance equation. The 35B-A3B activates only 3B parameters yet surpasses the previous 235B flagship — here's what that means for your AI stack.
Active Params (35B-A3B)
SWE-bench (27B)
/M Input (Flash)
Token Context
Key Takeaways
What Is the Qwen 3.5 Medium Series?
On February 24, 2026, Alibaba's Qwen team released the Qwen 3.5 medium model series — four models that sit between the flagship Qwen3.5-397B-A17B (released February 16) and smaller distilled variants. The series targets the production sweet spot: models compact enough for private infrastructure while maintaining frontier-level reasoning.
This marks a pivotal shift in the open-source AI landscape. Until now, achieving frontier performance required massive models with hundreds of billions of parameters. The Qwen 3.5 medium series demonstrates that careful architectural innovation — specifically the hybrid Gated DeltaNet plus Mixture-of-Experts design — can deliver equivalent or superior results at a fraction of the compute cost.
The 35B-A3B activates only 8.6% of its total parameters per forward pass, routing each token through specialized expert subnetworks. This means GPT-5-mini-class reasoning at a fraction of the inference cost.
All four models ship under Apache 2.0 on Hugging Face, ModelScope, Ollama, and GitHub. No gating, no usage restrictions, no license negotiations. Fine-tune, deploy, and sell without constraints.
The Complete Model Lineup
The series includes three Mixture-of-Experts models and one dense model, each optimized for different deployment scenarios.
| Model | Total Params | Active Params | Architecture | Context |
|---|---|---|---|---|
| Qwen3.5-Flash | ~35B | ~3B | MoE (hosted) | 1M tokens |
| Qwen3.5-35B-A3B | 35B | 3B | MoE + Hybrid Attn | 262K tokens |
| Qwen3.5-122B-A10B | 122B | 10B | MoE + Hybrid Attn | 262K tokens |
| Qwen3.5-27B | 27B | 27B (all) | Dense | 262K tokens |
The hosted version of 35B-A3B through Alibaba Cloud's Model Studio. Includes 1M token context window, native function calling, and built-in tool support. Ideal for production agentic workflows where cost efficiency matters.
The open-weight star of the series. Routes each token through 3B of its 35B total parameters. Runs on 8GB+ VRAM GPUs with GGUF quantization. Surpasses the previous 235B flagship across most benchmarks.
The largest medium model activates 10B of its 122B total parameters. Leads the lineup on agentic benchmarks including BFCL-V4 (72.2), BrowseComp (63.8), and Terminal-Bench 2 (49.4). Fits on NVIDIA DGX Spark hardware.
The only dense (non-MoE) model in the series. All 27B parameters activate on every forward pass, giving it the highest per-token reasoning density. Ties GPT-5 mini on SWE-bench Verified at 72.4.
Architecture Deep Dive: Gated DeltaNet Meets MoE
The Qwen 3.5 medium models introduce a hybrid attention architecture that combines two innovations: Gated Delta Networks for linear attention and traditional full attention blocks. This is not a minor tweak — it fundamentally changes how the models process long sequences.
How Hybrid Attention Works
Standard transformer attention scales quadratically with sequence length. Double the context window and you quadruple the compute. The Qwen 3.5 hybrid approach alternates between two attention mechanisms in a 3:1 ratio:
Gated DeltaNet Layers (3 of every 4 blocks)
These use linear attention that scales near-linearly with sequence length. The mechanism combines Mamba2's gated decay with a delta rule for updating hidden states. Each layer compresses the input sequence into a fixed-size state, enabling efficient processing of long contexts without the quadratic memory overhead.
Full Attention Layers (1 of every 4 blocks)
Standard quadratic attention is interspersed every fourth block to maintain fine-grained token-to-token reasoning. These layers preserve the model's ability to attend precisely to any position in the sequence — critical for tasks like code generation and complex reasoning chains.
Why This Matters for Production
The 3:1 linear-to-full ratio enables the hosted Flash variant to handle 1M token contexts without prohibitive compute costs. Processing a 500K token document costs roughly 3-4x a 50K document, not 100x.
Linear attention layers compress context into fixed-size states, dramatically reducing KV-cache memory. This is why the 35B-A3B can run on consumer GPUs — the MoE routing plus efficient attention keeps active memory requirements low.
Less compute per token means faster generation. In benchmarks, Qwen3.5-Plus delivered responses in 1/6th the time of Claude Sonnet 4.6 while maintaining competitive quality, directly enabled by the hybrid architecture.
Complete Benchmark Breakdown
Here is how the four Qwen 3.5 medium models stack up against GPT-5 mini, Claude Sonnet 4.5, and the previous-generation Qwen3-235B-A22B across key benchmark categories.
Knowledge and Reasoning
| Benchmark | 122B-A10B | 27B | 35B-A3B | GPT-5 mini | Claude Sonnet 4.5 |
|---|---|---|---|---|---|
| MMLU-Pro | 86.7 | 86.1 | 85.3 | 83.7 | 80.8 |
| GPQA Diamond | 86.6 | 85.5 | 84.2 | 82.8 | 80.1 |
| HMMT Feb 2025 | 91.4 | 92.0 | 89.0 | 89.2 | 90.0 |
| MMMLU | 86.7 | 85.9 | 85.2 | 86.2 | 78.2 |
| MMMU-Pro | 76.9 | 67.3 | 68.4 | 67.3 | 75.0 |
Coding and Software Engineering
| Benchmark | 122B-A10B | 27B | 35B-A3B | GPT-5 mini | Claude Sonnet 4.5 |
|---|---|---|---|---|---|
| SWE-bench Verified | 72.0 | 72.4 | 69.2 | 72.0 | 62.0 |
| Terminal-Bench 2 | 49.4 | 41.6 | 40.5 | 31.9 | 18.7 |
| LiveCodeBench v6 | 78.9 | 80.7 | 74.6 | 80.5 | 82.7 |
| CodeForces | 2100 | 1899 | 2028 | 2160 | 2157 |
Agentic Tasks
| Benchmark | 122B-A10B | 27B | 35B-A3B | GPT-5 mini | Claude Sonnet 4.5 |
|---|---|---|---|---|---|
| BFCL-V4 (Tool Use) | 72.2 | 68.5 | 67.3 | 55.5 | 54.8 |
| BrowseComp (Search) | 63.8 | 61.0 | 61.0 | 48.1 | 41.1 |
| ERQA (Embodied) | 62.0 | 60.5 | 64.7 | 52.5 | 54.0 |
Where Qwen 3.5 Leads — and Where It Doesn't
- Agentic tool use: 72.2 BFCL-V4 vs 55.5 (GPT-5 mini) — 30% advantage
- Terminal coding: 49.4 Terminal-Bench 2 vs 31.9 (GPT-5 mini) — 55% advantage
- Multilingual knowledge: 86.7 MMMLU tops all competitors including GPT-5 mini (86.2)
- Document understanding: 89.8 OmniDocBench v1.5 leads across the board
- Cost efficiency: $0.10/M input tokens is 10-50x cheaper than proprietary alternatives
- Competitive programming: GPT-5 mini (2160) and Claude Sonnet 4.5 (2157) still lead on CodeForces
- Visual reasoning (MMMU-Pro): Claude Sonnet 4.5 (75.0) matches 27B (75.0) and trails only 122B-A10B (76.9)
- Full-stack benchmarks: Claude Sonnet 4.5 leads FullStackBench en (58.9 vs 62.6 for 122B) with Qwen narrowing the gap
- Instruction following (IFEval): GPT-5 mini scores 93.9, ahead of 122B-A10B at 93.4
The pattern is clear: Qwen 3.5 medium models dominate agentic and multi-step reasoning tasks while remaining highly competitive on pure coding and knowledge benchmarks. For teams building AI agents, autonomous workflows, or tool-calling systems, these models represent the strongest open-source option available today.
Pricing and Value Analysis
The Qwen 3.5 medium models offer two cost structures: free self-hosting via open weights, or hosted API pricing through Alibaba Cloud.
Hosted API Pricing (Alibaba Cloud Model Studio)
| Model | Input / 1M Tokens | Output / 1M Tokens | Context Window |
|---|---|---|---|
| Qwen3.5-Flash | $0.10 | $0.40 | 1M tokens |
| Qwen3.5-Plus | $1.20 | — | 1M tokens |
Cost Comparison with Competitors
~13x cheaper
Qwen3.5-Flash at $0.10/M input vs Claude Sonnet 4.6 at $1.30/M input, with competitive quality on agentic tasks.
Free
All open-weight models are Apache 2.0. Run locally on your own GPUs with zero per-token costs — only infrastructure expenses.
Open weights
GPT-5 mini is API-only with no self-hosting option. Qwen 3.5 gives you the same SWE-bench performance with full control over deployment.
How to Access and Get Started
Qwen 3.5 medium models are available through every major distribution channel. Here is how to get each one running.
Option 1: Ollama (Fastest Local Setup)
# Install the 35B-A3B (runs on 8GB+ VRAM)
ollama run qwen3.5:35b-a3b
# Or the 27B dense model
ollama run qwen3.5:27b
# Or the 122B-A10B (needs more VRAM)
ollama run qwen3.5:122b-a10bOption 2: Hugging Face Transformers
# Python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3.5-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)Option 3: Alibaba Cloud API (OpenAI-Compatible)
# cURL — OpenAI-compatible endpoint
curl https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-flash",
"messages": [{"role": "user", "content": "Hello"}]
}'Additional Platforms
vLLM / SGLang
For production serving with optimized throughput. Both frameworks support the MoE architecture natively for efficient expert routing.
llama.cpp (GGUF)
CPU and GPU inference via quantized GGUF files from Unsloth. Available on Hugging Face at unsloth/Qwen3.5-35B-A3B-GGUF.
OpenRouter
Third-party API access at competitive rates. Model ID: qwen/qwen3.5-plus-02-15 and other variants.
ModelScope
Alternative download source for regions where Hugging Face access is limited. Full model weights and documentation available.
Who Should Use Which Model
Best for teams running thousands of API calls per day where cost is the primary constraint. The 1M token context and $0.10/M input pricing makes it ideal for:
- RAG pipelines processing large document sets
- Customer support chatbots with long conversation histories
- Multi-step agent workflows with tool calling
Best for developers who need on-device or private-cloud AI without sending data to external APIs. Ideal for:
- Privacy-sensitive enterprise deployments
- Fine-tuning for domain-specific tasks
- Laptop or single-GPU inference via Ollama
Best for teams building autonomous AI agents that need strong tool use, web browsing, and multi-step reasoning. Ideal for:
- Complex agent orchestration with function calling
- Autonomous web research and data extraction
- Long-horizon planning and execution tasks
Best for software engineering tasks where per-token reasoning density matters more than token efficiency. Ideal for:
- Code generation, review, and debugging
- Precise instruction following (IFEval, IFBench)
- Tasks requiring consistent output quality without MoE routing variance
Need Help Integrating Open-Source AI?
Whether you are deploying Qwen 3.5, building agentic workflows, or evaluating which model fits your production stack — our team can help you ship faster.
Related Articles
Continue exploring with these related guides