Qwen 3.5: 397B MoE Benchmarks, Pricing & Complete Guide
Qwen 3.5-397B scores 83.6 on LiveCodeBench v6 and 91.3 on AIME26 with 17B active MoE params. Benchmarks vs GPT-5.2, Claude, and pricing details.
Total Parameters
Active per Token
LiveCodeBench v6
Languages Supported
Key Takeaways
Qwen 3.5 is Alibaba Cloud's latest flagship AI model family, released on February 16, 2026. Built around a sparse Mixture-of-Experts (MoE) architecture, the headline model — Qwen3.5-397B-A17B — packs 397 billion total parameters while activating only 17 billion per forward pass. This design reportedly delivers frontier-level reasoning, coding, and visual agentic performance at 60% lower cost and 8x higher throughput compared to Alibaba's previous generation.
The release comes at a competitive moment in AI development. With ByteDance's Doubao 2.0 serving 200 million users and DeepSeek preparing its next model, Alibaba positions Qwen 3.5 as a direct challenger to Western frontier models like GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro — claiming superiority across 80% of evaluated benchmark categories.
What Is Qwen 3.5?
The flagship model ships in two distinct variants targeting different deployment scenarios — an open-weight release under Apache 2.0 and a hosted Qwen 3.5-Plus service through Alibaba Cloud.
- Apache 2.0 license
- Self-hostable on 8xH100 GPUs
- Full commercial use rights
- Native multimodal (text + images)
- 1 million token context window
- Built-in adaptive tool use
- Alibaba Cloud Model Studio
- OpenAI SDK compatible API
Architecture & MoE Design
Qwen 3.5's architecture builds on the Qwen3-Next foundation with several significant upgrades. The sparse Mixture-of-Experts design routes each token through just 17 billion of the 397 billion total parameters, achieving a 95% reduction in activation memory compared to dense models of equivalent capability.
Sparse MoE with Hybrid Attention
The model uses a heterogeneous setup that separates vision and language processing pathways for efficiency. Key architectural features include hybrid linear attention combined with sparse expert routing, enabling parallel computation across expert groups. Alibaba also introduced a native FP8 training pipeline that reduces activation memory by approximately 50%.
| Specification | Qwen 3.5-397B | Qwen3-Max-Thinking |
|---|---|---|
| Total Parameters | 397B | 1T+ |
| Active per Token | 17B | Not disclosed |
| Vocabulary | 250K tokens | 152K tokens |
| Languages | 201 | 119 |
| Architecture | Sparse MoE + Hybrid Attention | Dense MoE |
| Training Pipeline | Native FP8 | BF16/FP16 |
Inference Optimizations
Alibaba reports several inference-level optimizations including speculative decoding, rollout replay, and multi-turn rollout locking. Combined, these techniques yield 8.6x faster decoding at 32K context and up to 19x at 256K context versus Qwen3-Max. On 8xH100 GPUs, the model reportedly achieves 45 tokens per second.
Benchmark Performance
Qwen 3.5 delivers strong benchmark results across reasoning, coding, agentic, and multimodal categories. Alibaba claims the model outperforms GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro on 80% of evaluated benchmarks — though independent verification is still underway.
Reasoning & Mathematics
| Benchmark | Score | Category |
|---|---|---|
| AIME26 | 91.3 | Olympiad Mathematics |
| GPQA Diamond | 88.4 | Graduate-Level Reasoning |
| MMLU-Pro | 87.8 | Multilingual Knowledge |
| MMLU | 88.5 | General Knowledge |
| MathVista | 90.3 | Mathematical Reasoning |
Coding & Agentic
| Benchmark | Score | Category |
|---|---|---|
| LiveCodeBench v6 | 83.6 | Competitive Programming |
| SWE-bench Verified | 76.4 | Real Coding Workflows |
| Terminal-Bench 2 | 52.5 | Agentic Terminal Coding |
| BFCL v4 | 72.9 | Agentic Tool Use |
| BrowseComp | 78.6 | Agentic Search |
| IFBench | 76.5 | Instruction Following |
Multimodal Benchmarks
- MMMU85.0
- MMMU-Pro79.0
- OmniDocBench v1.590.8
- MathVista90.3
- Video-MME87.5
- VITA-Bench49.7
- ERQA67.5
Multimodal & Visual Agentic Capabilities
One of Qwen 3.5's most significant advances is its native multimodal architecture. Unlike models that bolt vision capabilities onto a language backbone, Qwen 3.5 fuses text, image, and video tokens from the very first pretraining stage through early fusion. This enables seamless cross-modal reasoning rather than treating different modalities as separate pipelines.
Visual Processing Specifications
Images
Up to 1344x1344 resolution
Video
60-second clips at 8 FPS
UI Analysis
Screenshot element detection
Visual Agentic Task Execution
Alibaba highlights Qwen 3.5's "visual agentic capabilities" as a differentiator. Rather than simply describing what it sees, the model can independently perform actions across mobile and desktop applications — analyzing UI screenshots, detecting interactive elements, and executing multi-step workflows.
This positions Qwen 3.5 alongside emerging agentic frameworks where AI models move beyond conversational interfaces into autonomous task execution. The VITA-Bench score of 49.7 (agentic multimodal interaction) and BFCL v4 score of 72.9 (tool use) suggest the model can handle structured tool calls, though complex real-world workflows may still benefit from orchestration layers.
Open Weight vs Qwen 3.5-Plus
Alibaba ships Qwen 3.5 in two distinct variants targeting different use cases. Understanding the trade-offs between self-hosted open-weight deployment and the managed Qwen 3.5-Plus service is essential for choosing the right option.
| Feature | Qwen3.5-397B-A17B (Open) | Qwen 3.5-Plus (Hosted) |
|---|---|---|
| License | Apache 2.0 | Proprietary (API access) |
| Context Window | Deployment-dependent | 1 million tokens |
| Tool Use | Manual integration | Built-in adaptive tool use |
| Platform | Self-hosted / HuggingFace | Alibaba Cloud Model Studio |
| Hardware Requirement | 8xH100 GPUs (recommended) | None (managed service) |
| Best For | Data sovereignty, fine-tuning | Rapid prototyping, long-context tasks |
The open-weight release follows Alibaba's established pattern of sharing competitive models with the community. For a deeper look at the full Qwen model family from 600M to 1T parameters, see our complete Qwen models guide.
Pricing & Cost Efficiency
Cost efficiency is a central selling point for Qwen 3.5. Alibaba reports approximately 60% lower running costs compared to the previous generation, combined with 8x higher throughput. For the hosted Qwen 3.5-Plus, processing 1 million tokens reportedly costs around $0.18.
~$0.18
Per 1M tokens (Plus)
60%
Cost reduction vs prior
10-60%
Token savings (250K vocab)
Vocabulary-Driven Savings
The expanded 250K-token vocabulary (up from 152K in Qwen 3) directly reduces token counts for non-English text. Alibaba reports 10-60% token cost reductions for global applications, particularly benefiting languages that were previously under-represented in the tokenizer. With 201 languages and dialects supported (a 69% increase over Qwen 3), this represents meaningful savings for multilingual deployments.
Self-Hosting Economics
For organizations choosing the open-weight path, the 17B active parameter count means significantly lower GPU memory requirements compared to dense models of similar capability. Running on 8xH100 GPUs, the model reportedly achieves 45 tokens per second — making self-hosting viable for enterprises with existing GPU infrastructure.
How to Access the Qwen 3.5 API
The Qwen 3.5-Plus API is available through Alibaba Cloud Model Studio with OpenAI SDK compatibility, making migration from existing OpenAI or Claude integrations straightforward. Here's a basic example using the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(
api_key="your-dashscope-api-key",
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
response = client.chat.completions.create(
model="qwen3.5-plus",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the MoE architecture in Qwen 3.5."}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")TypeScript / Node.js
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});
const response = await client.chat.completions.create({
model: "qwen3.5-plus",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Analyze this codebase for security issues." },
],
});
console.log(response.choices[0].message.content);Access Options
- Model Studio dashboard
- OpenAI SDK compatible endpoint
- Streaming and parallel tool calls
- Built-in web search integration
- 1M token context window
- HuggingFace model hub
- vLLM or TGI serving
- Full fine-tuning capability
- Data sovereignty compliance
- Custom context configuration
Conclusion
Qwen 3.5 represents a significant step in efficient AI architecture — delivering frontier-level performance with 95% fewer active parameters through sparse Mixture-of-Experts design. The benchmark numbers across reasoning, coding, and multimodal tasks position it as a serious contender alongside GPT-5.2 and Claude Opus 4.5, while the 60% cost reduction and 8x throughput improvement make it particularly compelling for cost-conscious deployments.
Whether you opt for the Apache 2.0 open-weight model for on-premise deployment or the hosted Qwen 3.5-Plus for its 1M-token context window, the choice between self-hosted control and managed convenience depends on your specific requirements. As independent benchmarks continue to verify Alibaba's claims, Qwen 3.5 is worth evaluating for any team looking at frontier AI capabilities without frontier pricing.
Ready to Integrate Agentic AI?
Whether you're evaluating Qwen 3.5, GPT-5.2, or Claude for production deployment, our team can help you navigate the rapidly evolving AI landscape and build solutions that deliver measurable results.
Frequently Asked Questions
Related Guides
Continue exploring AI model developments and frontier benchmarks