Breaking ReleaseOpen-Source AIMixture of Experts

Qwen 3.5 Medium Models: Benchmarks, Pricing, and Complete Guide

Alibaba's Qwen team just dropped four medium-sized models that rewrite the cost-performance equation. The 35B-A3B activates only 3B parameters yet surpasses the previous 235B flagship — here's what that means for your AI stack.

Digital Applied Team
February 25, 2026
10 min read
3B

Active Params (35B-A3B)

72.4

SWE-bench (27B)

$0.10

/M Input (Flash)

1M

Token Context

Key Takeaways

3B active params beat 22B predecessor: Qwen3.5-35B-A3B with only 3B active parameters surpasses the previous-generation Qwen3-235B-A22B, proving that better architecture and data quality outweigh raw scale.
27B dense model ties GPT-5 mini: Qwen3.5-27B achieves 72.4 on SWE-bench Verified, matching GPT-5 mini — a remarkable result for a fully open-weight 27B dense model.
Flash starts at $0.10 per million tokens: Qwen3.5-Flash delivers frontier-adjacent intelligence at $0.10/M input tokens — roughly 1/13th the cost of Claude Sonnet 4.6 for comparable tasks.
1M token context via hybrid attention: The Gated DeltaNet plus MoE architecture alternates linear and full attention in a 3:1 ratio, enabling 1M token context windows with near-linear compute scaling.
Fully open-source under Apache 2.0: All four models are available on Hugging Face, Ollama, and ModelScope under Apache 2.0, with support across vLLM, SGLang, and llama.cpp.
01

What Is the Qwen 3.5 Medium Series?

On February 24, 2026, Alibaba's Qwen team released the Qwen 3.5 medium model series — four models that sit between the flagship Qwen3.5-397B-A17B (released February 16) and smaller distilled variants. The series targets the production sweet spot: models compact enough for private infrastructure while maintaining frontier-level reasoning.

This marks a pivotal shift in the open-source AI landscape. Until now, achieving frontier performance required massive models with hundreds of billions of parameters. The Qwen 3.5 medium series demonstrates that careful architectural innovation — specifically the hybrid Gated DeltaNet plus Mixture-of-Experts design — can deliver equivalent or superior results at a fraction of the compute cost.

Efficiency Breakthrough

The 35B-A3B activates only 8.6% of its total parameters per forward pass, routing each token through specialized expert subnetworks. This means GPT-5-mini-class reasoning at a fraction of the inference cost.

Fully Open Access

All four models ship under Apache 2.0 on Hugging Face, ModelScope, Ollama, and GitHub. No gating, no usage restrictions, no license negotiations. Fine-tune, deploy, and sell without constraints.

02

The Complete Model Lineup

The series includes three Mixture-of-Experts models and one dense model, each optimized for different deployment scenarios.

ModelTotal ParamsActive ParamsArchitectureContext
Qwen3.5-Flash~35B~3BMoE (hosted)1M tokens
Qwen3.5-35B-A3B35B3BMoE + Hybrid Attn262K tokens
Qwen3.5-122B-A10B122B10BMoE + Hybrid Attn262K tokens
Qwen3.5-27B27B27B (all)Dense262K tokens
Qwen3.5-Flash
Production API — lowest cost, longest context

The hosted version of 35B-A3B through Alibaba Cloud's Model Studio. Includes 1M token context window, native function calling, and built-in tool support. Ideal for production agentic workflows where cost efficiency matters.

Qwen3.5-35B-A3B
MoE — 3B active, consumer GPU friendly

The open-weight star of the series. Routes each token through 3B of its 35B total parameters. Runs on 8GB+ VRAM GPUs with GGUF quantization. Surpasses the previous 235B flagship across most benchmarks.

Qwen3.5-122B-A10B
MoE — 10B active, strongest agentic performance

The largest medium model activates 10B of its 122B total parameters. Leads the lineup on agentic benchmarks including BFCL-V4 (72.2), BrowseComp (63.8), and Terminal-Bench 2 (49.4). Fits on NVIDIA DGX Spark hardware.

Qwen3.5-27B
Dense — strongest coding and instruction following

The only dense (non-MoE) model in the series. All 27B parameters activate on every forward pass, giving it the highest per-token reasoning density. Ties GPT-5 mini on SWE-bench Verified at 72.4.

03

Architecture Deep Dive: Gated DeltaNet Meets MoE

The Qwen 3.5 medium models introduce a hybrid attention architecture that combines two innovations: Gated Delta Networks for linear attention and traditional full attention blocks. This is not a minor tweak — it fundamentally changes how the models process long sequences.

How Hybrid Attention Works

Standard transformer attention scales quadratically with sequence length. Double the context window and you quadruple the compute. The Qwen 3.5 hybrid approach alternates between two attention mechanisms in a 3:1 ratio:

Gated DeltaNet Layers (3 of every 4 blocks)

These use linear attention that scales near-linearly with sequence length. The mechanism combines Mamba2's gated decay with a delta rule for updating hidden states. Each layer compresses the input sequence into a fixed-size state, enabling efficient processing of long contexts without the quadratic memory overhead.

Full Attention Layers (1 of every 4 blocks)

Standard quadratic attention is interspersed every fourth block to maintain fine-grained token-to-token reasoning. These layers preserve the model's ability to attend precisely to any position in the sequence — critical for tasks like code generation and complex reasoning chains.

Why This Matters for Production

Long-Context Efficiency

The 3:1 linear-to-full ratio enables the hosted Flash variant to handle 1M token contexts without prohibitive compute costs. Processing a 500K token document costs roughly 3-4x a 50K document, not 100x.

Lower Memory Footprint

Linear attention layers compress context into fixed-size states, dramatically reducing KV-cache memory. This is why the 35B-A3B can run on consumer GPUs — the MoE routing plus efficient attention keeps active memory requirements low.

Faster Inference Speed

Less compute per token means faster generation. In benchmarks, Qwen3.5-Plus delivered responses in 1/6th the time of Claude Sonnet 4.6 while maintaining competitive quality, directly enabled by the hybrid architecture.

04

Complete Benchmark Breakdown

Here is how the four Qwen 3.5 medium models stack up against GPT-5 mini, Claude Sonnet 4.5, and the previous-generation Qwen3-235B-A22B across key benchmark categories.

Knowledge and Reasoning

Benchmark122B-A10B27B35B-A3BGPT-5 miniClaude Sonnet 4.5
MMLU-Pro86.786.185.383.780.8
GPQA Diamond86.685.584.282.880.1
HMMT Feb 202591.492.089.089.290.0
MMMLU86.785.985.286.278.2
MMMU-Pro76.967.368.467.375.0

Coding and Software Engineering

Benchmark122B-A10B27B35B-A3BGPT-5 miniClaude Sonnet 4.5
SWE-bench Verified72.072.469.272.062.0
Terminal-Bench 249.441.640.531.918.7
LiveCodeBench v678.980.774.680.582.7
CodeForces21001899202821602157

Agentic Tasks

Benchmark122B-A10B27B35B-A3BGPT-5 miniClaude Sonnet 4.5
BFCL-V4 (Tool Use)72.268.567.355.554.8
BrowseComp (Search)63.861.061.048.141.1
ERQA (Embodied)62.060.564.752.554.0
05

Where Qwen 3.5 Leads — and Where It Doesn't

Where Qwen 3.5 Leads
  • Agentic tool use: 72.2 BFCL-V4 vs 55.5 (GPT-5 mini) — 30% advantage
  • Terminal coding: 49.4 Terminal-Bench 2 vs 31.9 (GPT-5 mini) — 55% advantage
  • Multilingual knowledge: 86.7 MMMLU tops all competitors including GPT-5 mini (86.2)
  • Document understanding: 89.8 OmniDocBench v1.5 leads across the board
  • Cost efficiency: $0.10/M input tokens is 10-50x cheaper than proprietary alternatives
Where Competitors Lead
  • Competitive programming: GPT-5 mini (2160) and Claude Sonnet 4.5 (2157) still lead on CodeForces
  • Visual reasoning (MMMU-Pro): Claude Sonnet 4.5 (75.0) matches 27B (75.0) and trails only 122B-A10B (76.9)
  • Full-stack benchmarks: Claude Sonnet 4.5 leads FullStackBench en (58.9 vs 62.6 for 122B) with Qwen narrowing the gap
  • Instruction following (IFEval): GPT-5 mini scores 93.9, ahead of 122B-A10B at 93.4

The pattern is clear: Qwen 3.5 medium models dominate agentic and multi-step reasoning tasks while remaining highly competitive on pure coding and knowledge benchmarks. For teams building AI agents, autonomous workflows, or tool-calling systems, these models represent the strongest open-source option available today.

06

Pricing and Value Analysis

The Qwen 3.5 medium models offer two cost structures: free self-hosting via open weights, or hosted API pricing through Alibaba Cloud.

Hosted API Pricing (Alibaba Cloud Model Studio)

ModelInput / 1M TokensOutput / 1M TokensContext Window
Qwen3.5-Flash$0.10$0.401M tokens
Qwen3.5-Plus$1.201M tokens

Cost Comparison with Competitors

vs Claude Sonnet 4.6

~13x cheaper

Qwen3.5-Flash at $0.10/M input vs Claude Sonnet 4.6 at $1.30/M input, with competitive quality on agentic tasks.

vs Self-Hosting

Free

All open-weight models are Apache 2.0. Run locally on your own GPUs with zero per-token costs — only infrastructure expenses.

vs GPT-5 mini

Open weights

GPT-5 mini is API-only with no self-hosting option. Qwen 3.5 gives you the same SWE-bench performance with full control over deployment.

07

How to Access and Get Started

Qwen 3.5 medium models are available through every major distribution channel. Here is how to get each one running.

Option 1: Ollama (Fastest Local Setup)

# Install the 35B-A3B (runs on 8GB+ VRAM)
ollama run qwen3.5:35b-a3b

# Or the 27B dense model
ollama run qwen3.5:27b

# Or the 122B-A10B (needs more VRAM)
ollama run qwen3.5:122b-a10b

Option 2: Hugging Face Transformers

# Python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3.5-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto"
)

Option 3: Alibaba Cloud API (OpenAI-Compatible)

# cURL — OpenAI-compatible endpoint
curl https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-flash",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Additional Platforms

vLLM / SGLang

For production serving with optimized throughput. Both frameworks support the MoE architecture natively for efficient expert routing.

llama.cpp (GGUF)

CPU and GPU inference via quantized GGUF files from Unsloth. Available on Hugging Face at unsloth/Qwen3.5-35B-A3B-GGUF.

OpenRouter

Third-party API access at competitive rates. Model ID: qwen/qwen3.5-plus-02-15 and other variants.

ModelScope

Alternative download source for regions where Hugging Face access is limited. Full model weights and documentation available.

08

Who Should Use Which Model

Qwen3.5-Flash — High-Volume Production

Best for teams running thousands of API calls per day where cost is the primary constraint. The 1M token context and $0.10/M input pricing makes it ideal for:

  • RAG pipelines processing large document sets
  • Customer support chatbots with long conversation histories
  • Multi-step agent workflows with tool calling
Qwen3.5-35B-A3B — Local / Edge Deployment

Best for developers who need on-device or private-cloud AI without sending data to external APIs. Ideal for:

  • Privacy-sensitive enterprise deployments
  • Fine-tuning for domain-specific tasks
  • Laptop or single-GPU inference via Ollama
Qwen3.5-122B-A10B — Agentic Workflows

Best for teams building autonomous AI agents that need strong tool use, web browsing, and multi-step reasoning. Ideal for:

  • Complex agent orchestration with function calling
  • Autonomous web research and data extraction
  • Long-horizon planning and execution tasks
Qwen3.5-27B — Coding and Instruction Following

Best for software engineering tasks where per-token reasoning density matters more than token efficiency. Ideal for:

  • Code generation, review, and debugging
  • Precise instruction following (IFEval, IFBench)
  • Tasks requiring consistent output quality without MoE routing variance

Need Help Integrating Open-Source AI?

Whether you are deploying Qwen 3.5, building agentic workflows, or evaluating which model fits your production stack — our team can help you ship faster.

Frequently Asked Questions

Related Articles

Continue exploring with these related guides