AI Development6 min read

Qwen 3.5: 397B MoE Benchmarks, Pricing & Complete Guide

Qwen 3.5-397B scores 83.6 on LiveCodeBench v6 and 91.3 on AIME26 with 17B active MoE params. Benchmarks vs GPT-5.2, Claude, and pricing details.

Digital Applied Team

February 16, 2026

6 min read

397B

Total Parameters

17B

Active per Token

83.6

LiveCodeBench v6

201

Languages Supported

Key Takeaways

Sparse MoE efficiency: 397B total parameters with only 17B active per token, reducing activation memory by 95% while matching trillion-parameter performance.

Benchmark leader across categories: Scores 83.6 on LiveCodeBench v6, 91.3 on AIME26, and 88.4 on GPQA Diamond — reportedly outperforming GPT-5.2 and Claude Opus 4.5 on 80% of evaluated categories.

Visual agentic capabilities: Natively operates across mobile and desktop applications, analyzing UI screenshots, detecting elements, and executing multi-step tasks autonomously.

Native multimodal from pretraining: Early fusion architecture processes text, images (up to 1344x1344), and 60-second video clips from the first pretraining stage.

60% cheaper with 8x throughput: Delivers 8.6x–19x faster decoding than Qwen3-Max at roughly 60% lower cost, with 1M-token context for approximately $0.18.

Apache 2.0 open source: The 397B open-weight model ships under Apache 2.0, while the hosted Qwen 3.5-Plus variant offers 1M-token context via Alibaba Cloud.

Qwen 3.5 is Alibaba Cloud's latest flagship AI model family, released on February 16, 2026. Built around a sparse Mixture-of-Experts (MoE) architecture, the headline model — Qwen3.5-397B-A17B — packs 397 billion total parameters while activating only 17 billion per forward pass. This design reportedly delivers frontier-level reasoning, coding, and visual agentic performance at 60% lower cost and 8x higher throughput compared to Alibaba's previous generation.

The release comes at a competitive moment in AI development. With ByteDance's Doubao 2.0 serving 200 million users and DeepSeek preparing its next model, Alibaba positions Qwen 3.5 as a direct challenger to Western frontier models like GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro — claiming superiority across 80% of evaluated benchmark categories.

Context

Qwen 3.5 succeeds the Qwen3-Max-Thinking model, which featured over 1 trillion parameters. The new model achieves comparable or better performance with a fraction of the active compute.

What Is Qwen 3.5?

The flagship model ships in two distinct variants targeting different deployment scenarios — an open-weight release under Apache 2.0 and a hosted Qwen 3.5-Plus service through Alibaba Cloud.

Open Weight (397B-A17B)

Apache 2.0 license
Self-hostable on 8xH100 GPUs
Full commercial use rights
Native multimodal (text + images)

Qwen 3.5-Plus (Hosted)

1 million token context window
Built-in adaptive tool use
Alibaba Cloud Model Studio
OpenAI SDK compatible API

Architecture & MoE Design

Qwen 3.5's architecture builds on the Qwen3-Next foundation with several significant upgrades. The sparse Mixture-of-Experts design routes each token through just 17 billion of the 397 billion total parameters, achieving a 95% reduction in activation memory compared to dense models of equivalent capability.

Sparse MoE with Hybrid Attention

The model uses a heterogeneous setup that separates vision and language processing pathways for efficiency. Key architectural features include hybrid linear attention combined with sparse expert routing, enabling parallel computation across expert groups. Alibaba also introduced a native FP8 training pipeline that reduces activation memory by approximately 50%.

Specification	Qwen 3.5-397B	Qwen3-Max-Thinking
Total Parameters	397B	1T+
Active per Token	17B	Not disclosed
Vocabulary	250K tokens	152K tokens
Languages	201	119
Architecture	Sparse MoE + Hybrid Attention	Dense MoE
Training Pipeline	Native FP8	BF16/FP16

Inference Optimizations

Alibaba reports several inference-level optimizations including speculative decoding, rollout replay, and multi-turn rollout locking. Combined, these techniques yield 8.6x faster decoding at 32K context and up to 19x at 256K context versus Qwen3-Max. On 8xH100 GPUs, the model reportedly achieves 45 tokens per second.

Reinforcement Learning at Scale

Qwen 3.5 uses an asynchronous reinforcement learning framework for continuous refinement, moving beyond the single-pass training paradigm toward iterative capability improvement.

Benchmark Performance

Qwen 3.5 delivers strong benchmark results across reasoning, coding, agentic, and multimodal categories. Alibaba claims the model outperforms GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro on 80% of evaluated benchmarks — though independent verification is still underway.

Reasoning & Mathematics

Benchmark	Score	Category
AIME26	91.3	Olympiad Mathematics
GPQA Diamond	88.4	Graduate-Level Reasoning
MMLU-Pro	87.8	Multilingual Knowledge
MMLU	88.5	General Knowledge
MathVista	90.3	Mathematical Reasoning

Coding & Agentic

Benchmark	Score	Category
LiveCodeBench v6	83.6	Competitive Programming
SWE-bench Verified	76.4	Real Coding Workflows
Terminal-Bench 2	52.5	Agentic Terminal Coding
BFCL v4	72.9	Agentic Tool Use
BrowseComp	78.6	Agentic Search
IFBench	76.5	Instruction Following

Competitive Context

The SWE-bench Verified score of 76.4 compares favorably to Claude Opus 4.6's 80%+, while the LiveCodeBench v6 score of 83.6 represents competitive programming performance near human level. For more on how these benchmarks stack up, see our Chinese AI models comparison guide.

Multimodal Benchmarks

Vision & Document

MMMU85.0
MMMU-Pro79.0
OmniDocBench v1.590.8
MathVista90.3

Video & Interaction

Video-MME87.5
VITA-Bench49.7
ERQA67.5

Multimodal & Visual Agentic Capabilities

One of Qwen 3.5's most significant advances is its native multimodal architecture. Unlike models that bolt vision capabilities onto a language backbone, Qwen 3.5 fuses text, image, and video tokens from the very first pretraining stage through early fusion. This enables seamless cross-modal reasoning rather than treating different modalities as separate pipelines.

Visual Processing Specifications

Images

Up to 1344x1344 resolution

Video

60-second clips at 8 FPS

UI Analysis

Screenshot element detection

Visual Agentic Task Execution

Alibaba highlights Qwen 3.5's "visual agentic capabilities" as a differentiator. Rather than simply describing what it sees, the model can independently perform actions across mobile and desktop applications — analyzing UI screenshots, detecting interactive elements, and executing multi-step workflows.

This positions Qwen 3.5 alongside emerging agentic frameworks where AI models move beyond conversational interfaces into autonomous task execution. The VITA-Bench score of 49.7 (agentic multimodal interaction) and BFCL v4 score of 72.9 (tool use) suggest the model can handle structured tool calls, though complex real-world workflows may still benefit from orchestration layers.

Digital Applied Services

Interested in integrating agentic AI into your business workflows? Our AI transformation team can help you evaluate models like Qwen 3.5 for production deployment.

Open Weight vs Qwen 3.5-Plus

Alibaba ships Qwen 3.5 in two distinct variants targeting different use cases. Understanding the trade-offs between self-hosted open-weight deployment and the managed Qwen 3.5-Plus service is essential for choosing the right option.

Feature	Qwen3.5-397B-A17B (Open)	Qwen 3.5-Plus (Hosted)
License	Apache 2.0	Proprietary (API access)
Context Window	Deployment-dependent	1 million tokens
Tool Use	Manual integration	Built-in adaptive tool use
Platform	Self-hosted / HuggingFace	Alibaba Cloud Model Studio
Hardware Requirement	8xH100 GPUs (recommended)	None (managed service)
Best For	Data sovereignty, fine-tuning	Rapid prototyping, long-context tasks

The open-weight release follows Alibaba's established pattern of sharing competitive models with the community. For a deeper look at the full Qwen model family from 600M to 1T parameters, see our complete Qwen models guide.

Pricing & Cost Efficiency

Cost efficiency is a central selling point for Qwen 3.5. Alibaba reports approximately 60% lower running costs compared to the previous generation, combined with 8x higher throughput. For the hosted Qwen 3.5-Plus, processing 1 million tokens reportedly costs around $0.18.

~$0.18

Per 1M tokens (Plus)

60%

Cost reduction vs prior

10-60%

Token savings (250K vocab)

Vocabulary-Driven Savings

The expanded 250K-token vocabulary (up from 152K in Qwen 3) directly reduces token counts for non-English text. Alibaba reports 10-60% token cost reductions for global applications, particularly benefiting languages that were previously under-represented in the tokenizer. With 201 languages and dialects supported (a 69% increase over Qwen 3), this represents meaningful savings for multilingual deployments.

Self-Hosting Economics

For organizations choosing the open-weight path, the 17B active parameter count means significantly lower GPU memory requirements compared to dense models of similar capability. Running on 8xH100 GPUs, the model reportedly achieves 45 tokens per second — making self-hosting viable for enterprises with existing GPU infrastructure.

How to Access the Qwen 3.5 API

The Qwen 3.5-Plus API is available through Alibaba Cloud Model Studio with OpenAI SDK compatibility, making migration from existing OpenAI or Claude integrations straightforward. Here's a basic example using the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    api_key="your-dashscope-api-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

response = client.chat.completions.create(
    model="qwen3.5-plus",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the MoE architecture in Qwen 3.5."}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

TypeScript / Node.js

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});

const response = await client.chat.completions.create({
  model: "qwen3.5-plus",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Analyze this codebase for security issues." },
  ],
});

console.log(response.choices[0].message.content);

Access Options

Alibaba Cloud API

Hosted inference with full features

Model Studio dashboard
OpenAI SDK compatible endpoint
Streaming and parallel tool calls
Built-in web search integration
1M token context window

Self-Hosted Deployment

Open weight with Apache 2.0

HuggingFace model hub
vLLM or TGI serving
Full fine-tuning capability
Data sovereignty compliance
Custom context configuration

Conclusion

Qwen 3.5 represents a significant step in efficient AI architecture — delivering frontier-level performance with 95% fewer active parameters through sparse Mixture-of-Experts design. The benchmark numbers across reasoning, coding, and multimodal tasks position it as a serious contender alongside GPT-5.2 and Claude Opus 4.5, while the 60% cost reduction and 8x throughput improvement make it particularly compelling for cost-conscious deployments.

Whether you opt for the Apache 2.0 open-weight model for on-premise deployment or the hosted Qwen 3.5-Plus for its 1M-token context window, the choice between self-hosted control and managed convenience depends on your specific requirements. As independent benchmarks continue to verify Alibaba's claims, Qwen 3.5 is worth evaluating for any team looking at frontier AI capabilities without frontier pricing.

Ready to Integrate Agentic AI?

Whether you're evaluating Qwen 3.5, GPT-5.2, or Claude for production deployment, our team can help you navigate the rapidly evolving AI landscape and build solutions that deliver measurable results.

Get Started Explore AI Transformation Services

Free consultation

Expert guidance

Tailored solutions