Breaking ReleaseOpen-Source AIMixture of Experts

Qwen 3.5 Medium Models: Benchmarks, Pricing, and Complete Guide

Alibaba's Qwen team just dropped four medium-sized models that rewrite the cost-performance equation. The 35B-A3B activates only 3B parameters yet surpasses the previous 235B flagship — here's what that means for your AI stack.

Digital Applied Team

February 25, 2026

4 min read

Active Params (35B-A3B)

72.4

SWE-bench (27B)

$0.10

/M Input (Flash)

Token Context

Key Takeaways

3B active params beat 22B predecessor: Qwen3.5-35B-A3B with only 3B active parameters surpasses the previous-generation Qwen3-235B-A22B, proving that better architecture and data quality outweigh raw scale.

27B dense model ties GPT-5 mini: Qwen3.5-27B achieves 72.4 on SWE-bench Verified, matching GPT-5 mini — a remarkable result for a fully open-weight 27B dense model.

Flash starts at $0.10 per million tokens: Qwen3.5-Flash delivers frontier-adjacent intelligence at $0.10/M input tokens — roughly 1/13th the cost of Claude Sonnet 4.6 for comparable tasks.

1M token context via hybrid attention: The Gated DeltaNet plus MoE architecture alternates linear and full attention in a 3:1 ratio, enabling 1M token context windows with near-linear compute scaling.

Fully open-source under Apache 2.0: All four models are available on Hugging Face, Ollama, and ModelScope under Apache 2.0, with support across vLLM, SGLang, and llama.cpp.

What Is the Qwen 3.5 Medium Series?

On February 24, 2026, Alibaba's Qwen team released the Qwen 3.5 medium model series — four models that sit between the flagship Qwen3.5-397B-A17B (released February 16) and smaller distilled variants. The series targets the production sweet spot: models compact enough for private infrastructure while maintaining frontier-level reasoning.

The headline result: Qwen3.5-35B-A3B with 3B active parameters now surpasses Qwen3-235B-A22B with 22B active parameters. Better architecture, higher data quality, and improved reinforcement learning moved intelligence forward — not bigger parameter counts.

This marks a pivotal shift in the open-source AI landscape. Until now, achieving frontier performance required massive models with hundreds of billions of parameters. The Qwen 3.5 medium series demonstrates that careful architectural innovation — specifically the hybrid Gated DeltaNet plus Mixture-of-Experts design — can deliver equivalent or superior results at a fraction of the compute cost.

Efficiency Breakthrough

The 35B-A3B activates only 8.6% of its total parameters per forward pass, routing each token through specialized expert subnetworks. This means GPT-5-mini-class reasoning at a fraction of the inference cost.

Fully Open Access

All four models ship under Apache 2.0 on Hugging Face, ModelScope, Ollama, and GitHub. No gating, no usage restrictions, no license negotiations. Fine-tune, deploy, and sell without constraints.

The Complete Model Lineup

The series includes three Mixture-of-Experts models and one dense model, each optimized for different deployment scenarios.

Model	Total Params	Active Params	Architecture	Context
Qwen3.5-Flash	~35B	~3B	MoE (hosted)	1M tokens
Qwen3.5-35B-A3B	35B	3B	MoE + Hybrid Attn	262K tokens
Qwen3.5-122B-A10B	122B	10B	MoE + Hybrid Attn	262K tokens
Qwen3.5-27B	27B	27B (all)	Dense	262K tokens

Qwen3.5-Flash

Production API — lowest cost, longest context

The hosted version of 35B-A3B through Alibaba Cloud's Model Studio. Includes 1M token context window, native function calling, and built-in tool support. Ideal for production agentic workflows where cost efficiency matters.

Qwen3.5-35B-A3B

MoE — 3B active, consumer GPU friendly

The open-weight star of the series. Routes each token through 3B of its 35B total parameters. Runs on 8GB+ VRAM GPUs with GGUF quantization. Surpasses the previous 235B flagship across most benchmarks.

Qwen3.5-122B-A10B

MoE — 10B active, strongest agentic performance

The largest medium model activates 10B of its 122B total parameters. Leads the lineup on agentic benchmarks including BFCL-V4 (72.2), BrowseComp (63.8), and Terminal-Bench 2 (49.4). Fits on NVIDIA DGX Spark hardware.

Qwen3.5-27B

Dense — strongest coding and instruction following

The only dense (non-MoE) model in the series. All 27B parameters activate on every forward pass, giving it the highest per-token reasoning density. Ties GPT-5 mini on SWE-bench Verified at 72.4.

Architecture Deep Dive: Gated DeltaNet Meets MoE

The Qwen 3.5 medium models introduce a hybrid attention architecture that combines two innovations: Gated Delta Networks for linear attention and traditional full attention blocks. This is not a minor tweak — it fundamentally changes how the models process long sequences.

How Hybrid Attention Works

Standard transformer attention scales quadratically with sequence length. Double the context window and you quadruple the compute. The Qwen 3.5 hybrid approach alternates between two attention mechanisms in a 3:1 ratio:

Gated DeltaNet Layers (3 of every 4 blocks)

These use linear attention that scales near-linearly with sequence length. The mechanism combines Mamba2's gated decay with a delta rule for updating hidden states. Each layer compresses the input sequence into a fixed-size state, enabling efficient processing of long contexts without the quadratic memory overhead.

Full Attention Layers (1 of every 4 blocks)

Standard quadratic attention is interspersed every fourth block to maintain fine-grained token-to-token reasoning. These layers preserve the model's ability to attend precisely to any position in the sequence — critical for tasks like code generation and complex reasoning chains.

Why This Matters for Production

Long-Context Efficiency

The 3:1 linear-to-full ratio enables the hosted Flash variant to handle 1M token contexts without prohibitive compute costs. Processing a 500K token document costs roughly 3-4x a 50K document, not 100x.

Lower Memory Footprint

Linear attention layers compress context into fixed-size states, dramatically reducing KV-cache memory. This is why the 35B-A3B can run on consumer GPUs — the MoE routing plus efficient attention keeps active memory requirements low.

Faster Inference Speed

Less compute per token means faster generation. In benchmarks, Qwen3.5-Plus delivered responses in 1/6th the time of Claude Sonnet 4.6 while maintaining competitive quality, directly enabled by the hybrid architecture.

Architecture takeaway: The Gated DeltaNet + MoE hybrid is a significant departure from standard transformer attention. It proves that linear attention variants work at scale in production models — not just research papers.

Complete Benchmark Breakdown

Here is how the four Qwen 3.5 medium models stack up against GPT-5 mini, Claude Sonnet 4.5, and the previous-generation Qwen3-235B-A22B across key benchmark categories.

Knowledge and Reasoning

Benchmark	122B-A10B	27B	35B-A3B	GPT-5 mini	Claude Sonnet 4.5
MMLU-Pro	86.7	86.1	85.3	83.7	80.8
GPQA Diamond	86.6	85.5	84.2	82.8	80.1
HMMT Feb 2025	91.4	92.0	89.0	89.2	90.0
MMMLU	86.7	85.9	85.2	86.2	78.2
MMMU-Pro	76.9	67.3	68.4	67.3	75.0

Coding and Software Engineering

Benchmark	122B-A10B	27B	35B-A3B	GPT-5 mini	Claude Sonnet 4.5
SWE-bench Verified	72.0	72.4	69.2	72.0	62.0
Terminal-Bench 2	49.4	41.6	40.5	31.9	18.7
LiveCodeBench v6	78.9	80.7	74.6	80.5	82.7
CodeForces	2100	1899	2028	2160	2157

Agentic Tasks

Benchmark	122B-A10B	27B	35B-A3B	GPT-5 mini	Claude Sonnet 4.5
BFCL-V4 (Tool Use)	72.2	68.5	67.3	55.5	54.8
BrowseComp (Search)	63.8	61.0	61.0	48.1	41.1
ERQA (Embodied)	62.0	60.5	64.7	52.5	54.0

Standout result: The 122B-A10B scores 72.2 on BFCL-V4 (tool use), outperforming GPT-5 mini (55.5) by a massive 30% margin. This makes it one of the strongest open-source models for building AI agents with function-calling capabilities.

Where Qwen 3.5 Leads — and Where It Doesn't

Where Qwen 3.5 Leads

Agentic tool use: 72.2 BFCL-V4 vs 55.5 (GPT-5 mini) — 30% advantage
Terminal coding: 49.4 Terminal-Bench 2 vs 31.9 (GPT-5 mini) — 55% advantage
Multilingual knowledge: 86.7 MMMLU tops all competitors including GPT-5 mini (86.2)
Document understanding: 89.8 OmniDocBench v1.5 leads across the board
Cost efficiency: $0.10/M input tokens is 10-50x cheaper than proprietary alternatives

Where Competitors Lead

Competitive programming: GPT-5 mini (2160) and Claude Sonnet 4.5 (2157) still lead on CodeForces
Visual reasoning (MMMU-Pro): Claude Sonnet 4.5 (75.0) matches 27B (75.0) and trails only 122B-A10B (76.9)
Full-stack benchmarks: Claude Sonnet 4.5 leads FullStackBench en (58.9 vs 62.6 for 122B) with Qwen narrowing the gap
Instruction following (IFEval): GPT-5 mini scores 93.9, ahead of 122B-A10B at 93.4

The pattern is clear: Qwen 3.5 medium models dominate agentic and multi-step reasoning tasks while remaining highly competitive on pure coding and knowledge benchmarks. For teams building AI agents, autonomous workflows, or tool-calling systems, these models represent the strongest open-source option available today.

Pricing and Value Analysis

The Qwen 3.5 medium models offer two cost structures: free self-hosting via open weights, or hosted API pricing through Alibaba Cloud.

Hosted API Pricing (Alibaba Cloud Model Studio)

Model	Input / 1M Tokens	Output / 1M Tokens	Context Window
Qwen3.5-Flash	$0.10	$0.40	1M tokens
Qwen3.5-Plus	$1.20	—	1M tokens

Cost Comparison with Competitors

vs Claude Sonnet 4.6

~13x cheaper

Qwen3.5-Flash at $0.10/M input vs Claude Sonnet 4.6 at $1.30/M input, with competitive quality on agentic tasks.

vs Self-Hosting

Free

All open-weight models are Apache 2.0. Run locally on your own GPUs with zero per-token costs — only infrastructure expenses.

vs GPT-5 mini

Open weights

GPT-5 mini is API-only with no self-hosting option. Qwen 3.5 gives you the same SWE-bench performance with full control over deployment.

Value proposition: In one evaluation, Qwen3.5-Plus delivered the highest-rated response at 1/13th the cost of Claude Sonnet 4.6 and in 1/6th the time. For high-volume agentic workloads, the Flash model makes previously cost-prohibitive use cases viable.

How to Access and Get Started

Qwen 3.5 medium models are available through every major distribution channel. Here is how to get each one running.

Option 1: Ollama (Fastest Local Setup)

# Install the 35B-A3B (runs on 8GB+ VRAM)
ollama run qwen3.5:35b-a3b

# Or the 27B dense model
ollama run qwen3.5:27b

# Or the 122B-A10B (needs more VRAM)
ollama run qwen3.5:122b-a10b

Option 2: Hugging Face Transformers

# Python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3.5-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto"
)

Option 3: Alibaba Cloud API (OpenAI-Compatible)

# cURL — OpenAI-compatible endpoint
curl https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-flash",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Additional Platforms

vLLM / SGLang

For production serving with optimized throughput. Both frameworks support the MoE architecture natively for efficient expert routing.

llama.cpp (GGUF)

CPU and GPU inference via quantized GGUF files from Unsloth. Available on Hugging Face at unsloth/Qwen3.5-35B-A3B-GGUF.

OpenRouter

Third-party API access at competitive rates. Model ID: qwen/qwen3.5-plus-02-15 and other variants.

ModelScope

Alternative download source for regions where Hugging Face access is limited. Full model weights and documentation available.

Who Should Use Which Model

Qwen3.5-Flash — High-Volume Production

Best for teams running thousands of API calls per day where cost is the primary constraint. The 1M token context and $0.10/M input pricing makes it ideal for:

RAG pipelines processing large document sets
Customer support chatbots with long conversation histories
Multi-step agent workflows with tool calling

Qwen3.5-35B-A3B — Local / Edge Deployment

Best for developers who need on-device or private-cloud AI without sending data to external APIs. Ideal for:

Privacy-sensitive enterprise deployments
Fine-tuning for domain-specific tasks
Laptop or single-GPU inference via Ollama

Qwen3.5-122B-A10B — Agentic Workflows

Best for teams building autonomous AI agents that need strong tool use, web browsing, and multi-step reasoning. Ideal for:

Complex agent orchestration with function calling
Autonomous web research and data extraction
Long-horizon planning and execution tasks

Qwen3.5-27B — Coding and Instruction Following

Best for software engineering tasks where per-token reasoning density matters more than token efficiency. Ideal for:

Code generation, review, and debugging
Precise instruction following (IFEval, IFBench)
Tasks requiring consistent output quality without MoE routing variance

Ready to Deploy Open-Source AI?

Whether you are deploying Qwen 3.5, building agentic workflows, or evaluating which model fits your production stack — our team can help you ship faster.

Get Started Explore AI & Digital Transformation

Free consultation

Expert guidance

Tailored solutions

Frequently Asked Questions