AI Development6 min read

Qwen 3.5: 397B MoE Benchmarks, Pricing & Complete Guide

Qwen 3.5-397B scores 83.6 on LiveCodeBench v6 and 91.3 on AIME26 with 17B active MoE params. Benchmarks vs GPT-5.2, Claude, and pricing details.

Digital Applied Team
February 16, 2026
6 min read
397B

Total Parameters

17B

Active per Token

83.6

LiveCodeBench v6

201

Languages Supported

Key Takeaways

Sparse MoE efficiency: 397B total parameters with only 17B active per token, reducing activation memory by 95% while matching trillion-parameter performance.
Benchmark leader across categories: Scores 83.6 on LiveCodeBench v6, 91.3 on AIME26, and 88.4 on GPQA Diamond — reportedly outperforming GPT-5.2 and Claude Opus 4.5 on 80% of evaluated categories.
Visual agentic capabilities: Natively operates across mobile and desktop applications, analyzing UI screenshots, detecting elements, and executing multi-step tasks autonomously.
Native multimodal from pretraining: Early fusion architecture processes text, images (up to 1344x1344), and 60-second video clips from the first pretraining stage.
60% cheaper with 8x throughput: Delivers 8.6x–19x faster decoding than Qwen3-Max at roughly 60% lower cost, with 1M-token context for approximately $0.18.
Apache 2.0 open source: The 397B open-weight model ships under Apache 2.0, while the hosted Qwen 3.5-Plus variant offers 1M-token context via Alibaba Cloud.

Qwen 3.5 is Alibaba Cloud's latest flagship AI model family, released on February 16, 2026. Built around a sparse Mixture-of-Experts (MoE) architecture, the headline model — Qwen3.5-397B-A17B — packs 397 billion total parameters while activating only 17 billion per forward pass. This design reportedly delivers frontier-level reasoning, coding, and visual agentic performance at 60% lower cost and 8x higher throughput compared to Alibaba's previous generation.

The release comes at a competitive moment in AI development. With ByteDance's Doubao 2.0 serving 200 million users and DeepSeek preparing its next model, Alibaba positions Qwen 3.5 as a direct challenger to Western frontier models like GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro — claiming superiority across 80% of evaluated benchmark categories.

What Is Qwen 3.5?

The flagship model ships in two distinct variants targeting different deployment scenarios — an open-weight release under Apache 2.0 and a hosted Qwen 3.5-Plus service through Alibaba Cloud.

Open Weight (397B-A17B)
  • Apache 2.0 license
  • Self-hostable on 8xH100 GPUs
  • Full commercial use rights
  • Native multimodal (text + images)
Qwen 3.5-Plus (Hosted)
  • 1 million token context window
  • Built-in adaptive tool use
  • Alibaba Cloud Model Studio
  • OpenAI SDK compatible API

Architecture & MoE Design

Qwen 3.5's architecture builds on the Qwen3-Next foundation with several significant upgrades. The sparse Mixture-of-Experts design routes each token through just 17 billion of the 397 billion total parameters, achieving a 95% reduction in activation memory compared to dense models of equivalent capability.

Sparse MoE with Hybrid Attention

The model uses a heterogeneous setup that separates vision and language processing pathways for efficiency. Key architectural features include hybrid linear attention combined with sparse expert routing, enabling parallel computation across expert groups. Alibaba also introduced a native FP8 training pipeline that reduces activation memory by approximately 50%.

SpecificationQwen 3.5-397BQwen3-Max-Thinking
Total Parameters397B1T+
Active per Token17BNot disclosed
Vocabulary250K tokens152K tokens
Languages201119
ArchitectureSparse MoE + Hybrid AttentionDense MoE
Training PipelineNative FP8BF16/FP16

Inference Optimizations

Alibaba reports several inference-level optimizations including speculative decoding, rollout replay, and multi-turn rollout locking. Combined, these techniques yield 8.6x faster decoding at 32K context and up to 19x at 256K context versus Qwen3-Max. On 8xH100 GPUs, the model reportedly achieves 45 tokens per second.

Benchmark Performance

Qwen 3.5 delivers strong benchmark results across reasoning, coding, agentic, and multimodal categories. Alibaba claims the model outperforms GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro on 80% of evaluated benchmarks — though independent verification is still underway.

Reasoning & Mathematics

BenchmarkScoreCategory
AIME2691.3Olympiad Mathematics
GPQA Diamond88.4Graduate-Level Reasoning
MMLU-Pro87.8Multilingual Knowledge
MMLU88.5General Knowledge
MathVista90.3Mathematical Reasoning

Coding & Agentic

BenchmarkScoreCategory
LiveCodeBench v683.6Competitive Programming
SWE-bench Verified76.4Real Coding Workflows
Terminal-Bench 252.5Agentic Terminal Coding
BFCL v472.9Agentic Tool Use
BrowseComp78.6Agentic Search
IFBench76.5Instruction Following

Multimodal Benchmarks

Vision & Document
  • MMMU85.0
  • MMMU-Pro79.0
  • OmniDocBench v1.590.8
  • MathVista90.3
Video & Interaction
  • Video-MME87.5
  • VITA-Bench49.7
  • ERQA67.5

Multimodal & Visual Agentic Capabilities

One of Qwen 3.5's most significant advances is its native multimodal architecture. Unlike models that bolt vision capabilities onto a language backbone, Qwen 3.5 fuses text, image, and video tokens from the very first pretraining stage through early fusion. This enables seamless cross-modal reasoning rather than treating different modalities as separate pipelines.

Visual Processing Specifications

Images

Up to 1344x1344 resolution

Video

60-second clips at 8 FPS

UI Analysis

Screenshot element detection

Visual Agentic Task Execution

Alibaba highlights Qwen 3.5's "visual agentic capabilities" as a differentiator. Rather than simply describing what it sees, the model can independently perform actions across mobile and desktop applications — analyzing UI screenshots, detecting interactive elements, and executing multi-step workflows.

This positions Qwen 3.5 alongside emerging agentic frameworks where AI models move beyond conversational interfaces into autonomous task execution. The VITA-Bench score of 49.7 (agentic multimodal interaction) and BFCL v4 score of 72.9 (tool use) suggest the model can handle structured tool calls, though complex real-world workflows may still benefit from orchestration layers.

Open Weight vs Qwen 3.5-Plus

Alibaba ships Qwen 3.5 in two distinct variants targeting different use cases. Understanding the trade-offs between self-hosted open-weight deployment and the managed Qwen 3.5-Plus service is essential for choosing the right option.

FeatureQwen3.5-397B-A17B (Open)Qwen 3.5-Plus (Hosted)
LicenseApache 2.0Proprietary (API access)
Context WindowDeployment-dependent1 million tokens
Tool UseManual integrationBuilt-in adaptive tool use
PlatformSelf-hosted / HuggingFaceAlibaba Cloud Model Studio
Hardware Requirement8xH100 GPUs (recommended)None (managed service)
Best ForData sovereignty, fine-tuningRapid prototyping, long-context tasks

The open-weight release follows Alibaba's established pattern of sharing competitive models with the community. For a deeper look at the full Qwen model family from 600M to 1T parameters, see our complete Qwen models guide.

Pricing & Cost Efficiency

Cost efficiency is a central selling point for Qwen 3.5. Alibaba reports approximately 60% lower running costs compared to the previous generation, combined with 8x higher throughput. For the hosted Qwen 3.5-Plus, processing 1 million tokens reportedly costs around $0.18.

~$0.18

Per 1M tokens (Plus)

60%

Cost reduction vs prior

10-60%

Token savings (250K vocab)

Vocabulary-Driven Savings

The expanded 250K-token vocabulary (up from 152K in Qwen 3) directly reduces token counts for non-English text. Alibaba reports 10-60% token cost reductions for global applications, particularly benefiting languages that were previously under-represented in the tokenizer. With 201 languages and dialects supported (a 69% increase over Qwen 3), this represents meaningful savings for multilingual deployments.

Self-Hosting Economics

For organizations choosing the open-weight path, the 17B active parameter count means significantly lower GPU memory requirements compared to dense models of similar capability. Running on 8xH100 GPUs, the model reportedly achieves 45 tokens per second — making self-hosting viable for enterprises with existing GPU infrastructure.

How to Access the Qwen 3.5 API

The Qwen 3.5-Plus API is available through Alibaba Cloud Model Studio with OpenAI SDK compatibility, making migration from existing OpenAI or Claude integrations straightforward. Here's a basic example using the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    api_key="your-dashscope-api-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

response = client.chat.completions.create(
    model="qwen3.5-plus",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the MoE architecture in Qwen 3.5."}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

TypeScript / Node.js

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});

const response = await client.chat.completions.create({
  model: "qwen3.5-plus",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Analyze this codebase for security issues." },
  ],
});

console.log(response.choices[0].message.content);

Access Options

Alibaba Cloud API
Hosted inference with full features
  • Model Studio dashboard
  • OpenAI SDK compatible endpoint
  • Streaming and parallel tool calls
  • Built-in web search integration
  • 1M token context window
Self-Hosted Deployment
Open weight with Apache 2.0
  • HuggingFace model hub
  • vLLM or TGI serving
  • Full fine-tuning capability
  • Data sovereignty compliance
  • Custom context configuration

Conclusion

Qwen 3.5 represents a significant step in efficient AI architecture — delivering frontier-level performance with 95% fewer active parameters through sparse Mixture-of-Experts design. The benchmark numbers across reasoning, coding, and multimodal tasks position it as a serious contender alongside GPT-5.2 and Claude Opus 4.5, while the 60% cost reduction and 8x throughput improvement make it particularly compelling for cost-conscious deployments.

Whether you opt for the Apache 2.0 open-weight model for on-premise deployment or the hosted Qwen 3.5-Plus for its 1M-token context window, the choice between self-hosted control and managed convenience depends on your specific requirements. As independent benchmarks continue to verify Alibaba's claims, Qwen 3.5 is worth evaluating for any team looking at frontier AI capabilities without frontier pricing.

Ready to Integrate Agentic AI?

Whether you're evaluating Qwen 3.5, GPT-5.2, or Claude for production deployment, our team can help you navigate the rapidly evolving AI landscape and build solutions that deliver measurable results.

Free consultation
Expert guidance
Tailored solutions

Frequently Asked Questions

Related Guides

Continue exploring AI model developments and frontier benchmarks