Mistral Small 4: 119B MoE for Reasoning, Vision, Coding
Mistral Small 4 packs 119B parameters (6B active via MoE) with 256K context under Apache 2.0. Unified model for reasoning, vision, and coding.
Total Parameters
Active Params per Token
Context Window Tokens
Supported Languages
Key Takeaways
The AI model landscape in 2026 is defined by a tension between scale and accessibility. Larger models deliver better results, but their computational requirements make them expensive to run and difficult to deploy privately. Mistral Small 4 resolves this tension with a Mixture of Experts architecture that packs 119 billion parameters into a model that runs at the cost of a 22 billion parameter dense system — delivering reasoning, vision, and coding quality that competes with much larger proprietary models.
Released by Mistral AI in early 2026, Mistral Small 4 represents a significant step forward from its predecessor, adding native multimodal vision input, an extended thinking mode for complex reasoning, and benchmark-leading coding performance. The Apache 2.0 license makes it one of the few frontier-capable models available for fully unrestricted commercial use and private deployment. For organizations building AI-powered digital products and workflows, this combination of capability and openness is rare in the current market.
This guide covers the architecture, benchmark performance, deployment options, and practical business applications of Mistral Small 4. It also situates the model within the broader open-weight frontier model landscape so you can make informed decisions about when it is the right choice versus proprietary alternatives or other open models like NVIDIA's Nemotron Super 120B.
What Is Mistral Small 4
Mistral Small 4 is the fourth generation of Mistral AI's Small model series, designed to occupy the middle tier of their model lineup between the lightweight Mistral 7B class models and the full-size Mistral Large series. The "Small" designation is somewhat misleading at 119 billion total parameters — it refers to the compute footprint relative to dense models of equivalent quality rather than absolute parameter count.
The model represents Mistral's most capable open-weight release to date, combining four capabilities that previously required separate specialized models: fast conversational response, extended chain-of-thought reasoning, multimodal vision understanding, and high-performance code generation. This consolidation into a single model simplifies architecture, reduces operational complexity, and lowers the total cost of building AI-powered applications.
119B total parameters with approximately 22B activated per token. Delivers frontier-quality outputs at the inference cost of a fraction of a dense 119B model, enabling deployment on high-end consumer and prosumer hardware.
Processes images, documents, screenshots, charts, and diagrams natively alongside text. No separate vision model required — one API endpoint handles all modalities within a single 128K context window.
Fully open weights under Apache 2.0 permitting commercial use, fine-tuning, and private deployment without usage fees. One of very few frontier-class models with no restrictions on distribution or modification.
Mistral AI positions Small 4 as the practical choice for developers and businesses that need top-tier AI capabilities without the vendor lock-in, privacy trade-offs, or per-token costs of fully proprietary APIs. The model targets use cases that require reliable, consistent quality at scale — customer-facing applications, internal tools, and workflows where the volume of API calls makes per-token pricing economically significant.
MoE Architecture and 119B Parameters
The Mixture of Experts architecture is the key innovation that allows Mistral Small 4 to deliver its performance-to-compute ratio. In a standard dense transformer model, every parameter is used for every token processed. In an MoE model, the network is divided into many expert sub-networks, and a learned routing mechanism selects only a small subset of experts to process each token.
Mistral Small 4 contains 119 billion parameters distributed across its expert layers, but activates approximately 22 billion parameters per token during inference. This means the GPU memory bandwidth and arithmetic operations required for a forward pass are comparable to a 22B dense model, not a 119B one. The quality benefit comes from the full 119B parameters being available as a pool of specialized knowledge that the routing mechanism draws from based on the content of each token.
The trade-off in MoE models is that while inference FLOPS are comparable to the active parameter count, you still need to load all 119 billion parameters into GPU memory — requiring approximately 240 GB of VRAM in full precision, or around 60–70 GB with 4-bit quantization. This means Mistral Small 4 runs on a single 8×H100 server in production, or on a high-end multi-GPU workstation with quantization for development use. The quality-per-inference-FLOP ratio remains exceptional compared to any dense model that could fit in the same memory.
Quantization note: With 4-bit quantization via llama.cpp or AWQ, Mistral Small 4 fits in approximately 60–70 GB of VRAM with modest quality degradation. This enables deployment on configurations like 2×RTX 4090 (48 GB), 1×H100 80 GB with CPU offloading, or M2/M3 Ultra Macs with 192 GB unified memory — covering a wide range of on-premises deployment scenarios.
Reasoning Capabilities and Extended Thinking
One of the most significant additions in Mistral Small 4 relative to its predecessor is the extended thinking mode. This optional reasoning capability allows the model to generate and process intermediate chain-of-thought steps before producing its final response — the same fundamental technique used by dedicated reasoning models like OpenAI's o-series and DeepSeek R1.
The thinking mode is controlled at inference time through a budget parameter that specifies the maximum number of tokens the model may use for internal reasoning before answering. A higher budget enables more thorough exploration of problem space, useful for complex mathematical derivations, multi-constraint optimization, and multi-step logical proofs. A lower budget or disabled thinking provides fast direct generation for conversational tasks where extended deliberation adds no value. This is directly comparable to the model flexibility discussed in the GPT-5 standard and thinking variants guide.
Ideal for conversational assistants, content generation, summarization, translation, and classification tasks. Low latency, low token cost. Appropriate for the vast majority of production workloads where the question does not require systematic multi-step reasoning.
Activates chain-of-thought reasoning for complex mathematics, multi-step code debugging, strategic analysis, and logical inference. Higher token usage and latency, but substantially improved accuracy on problems that require systematic exploration before answering.
Competition mathematics, physics problem solving, statistical analysis, and formal proofs benefit most from extended thinking. The model significantly closes the gap with dedicated reasoning models on MATH-500 and AIME benchmarks when thinking is enabled.
Both modes show strong performance on IFEval and complex instruction benchmarks. Extended thinking particularly helps when instructions contain nested constraints or require the model to plan its output before writing it.
For application developers, the practical implication is that Mistral Small 4 can serve as a unified model for both fast interactive features and slower analytical tasks within the same application, simply by adjusting the thinking budget on a per-request basis. This eliminates the need to maintain two separate model integrations and the latency of routing between them at the application layer.
Vision and Multimodal Understanding
Mistral Small 4 is the first model in the Mistral Small series to include native vision input. Previous versions were text-only, requiring separate vision models and additional integration complexity for applications that needed to process images. The new multimodal capability is natively integrated — images are passed as part of the messages array in the standard API format alongside text, processed within the same 128K context window.
The vision encoder handles a broad range of input types including natural photographs, UI screenshots, scanned documents and PDFs, data visualizations, technical diagrams, handwritten notes, and mixed-content pages combining text and graphics. This breadth makes it suitable for diverse document processing workflows without needing to pre-classify input type or route to specialized models.
Document Intelligence
Extract structured data from invoices, contracts, forms, and reports. Answer questions about document content. Compare multiple document versions. Summarize lengthy PDF documents with complex layouts.
Data Visualization Analysis
Interpret charts, graphs, and dashboards. Extract underlying data points from visualizations. Identify trends and anomalies. Generate textual summaries of visual data suitable for accessibility or reporting.
UI and Screenshot Analysis
Generate code from UI mockups and screenshots. Perform automated visual QA by comparing expected and actual UI states. Extract text from UI elements. Describe user interface layouts for accessibility purposes.
Technical Diagram Comprehension
Understand architecture diagrams, flowcharts, entity relationship diagrams, circuit schematics, and engineering drawings. Answer questions about system design shown in visual form. Convert diagrams to text descriptions or code.
The combination of vision and extended thinking is particularly powerful for complex analytical tasks. An agent processing a multi-page financial report can ingest both the charts and text simultaneously, apply chain-of-thought reasoning across all inputs, and produce a coherent analysis that would otherwise require multiple model calls and output stitching at the application layer.
Coding Performance and Benchmarks
Code generation is one of the areas where Mistral Small 4 shows the most substantial improvement over its predecessor. The model was specifically trained on a curated high-quality code corpus spanning major programming languages, and its benchmark performance reflects this investment. On HumanEval, MBPP, and MultiPL-E, it scores competitively with models several times its active parameter size.
Strong performance on HumanEval and MBPP. Handles single-file functions through multi-file feature implementations. Follows docstring specifications accurately and generates idiomatic code in Python, TypeScript, Rust, Go, Java, and C++.
Extended thinking mode substantially improves performance on SWE-bench style repository-level tasks. The model can trace execution paths, identify root causes of bugs, and reason about side effects across multiple files.
Generate UI code from screenshots and mockups. Convert design specifications in image format to React, Vue, or HTML/CSS. Analyze existing UI screenshots to produce components matching the visual design without separate tooling.
Generates comprehensive unit and integration test suites. Performs code review with actionable, contextually relevant suggestions. Identifies security vulnerabilities, performance anti-patterns, and maintainability issues in submitted code.
For agentic coding workflows — where the model must plan a multi-step implementation, execute it across several files, and validate the result — Mistral Small 4 benefits significantly from its large context window. Fitting an entire repository or a complex codebase section in context enables the model to reason about dependencies and impacts in ways that are impossible with smaller windows. Compared to other open-weight coding-focused models like NVIDIA Nemotron Super 120B, Mistral Small 4 trades some pure coding benchmark performance for broader multimodal capability and a more permissive license.
128K Context Window and Multilingual Support
Mistral Small 4 supports a 128,000 token context window, shared across both text and image inputs. This is sufficient to hold approximately 100,000 words of text — equivalent to a full-length non-fiction book, a substantial codebase, or a lengthy document collection — alongside several high-resolution images. For most business applications, 128K context eliminates the need for chunking and retrieval-based approaches to handle long documents.
The multilingual capability spans more than 30 languages, with particularly strong performance across European languages and major Asian languages. Importantly, the multilingual training extends beyond conversational fluency — the model demonstrates strong reasoning and code generation quality in non-English languages, not just translation capability. This makes it genuinely suitable for global product deployments where the interface language is not English.
European (Tier 1)
English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Swedish, Danish, Finnish, Norwegian
Asian and Middle Eastern
Chinese (Simplified and Traditional), Japanese, Korean, Arabic, Hindi, Vietnamese, Indonesian, Thai, Turkish
Reasoning and coding quality in non-English languages approaches English-language performance on most benchmarks, particularly for the Tier 1 European languages.
Context usage with images: Each image consumes a variable number of tokens depending on resolution, typically 500–2,000 tokens for standard-resolution images. For document-heavy workflows, count image tokens carefully against the 128K limit, especially when processing multi-page documents that include both scanned pages and their extracted text.
Deployment Options and Apache 2.0 License
The Apache 2.0 license is the most commercially permissive open source license in common use. For Mistral Small 4, this means businesses can download the model weights, fine-tune them on proprietary data, deploy them on private infrastructure, embed them in commercial products, and distribute modified versions — all without paying royalties or accepting usage restrictions. This stands in sharp contrast to models with custom community licenses that prohibit commercial use or require permission for deployments above certain user thresholds.
Available through Mistral's La Plateforme API with per-token pricing. Easiest starting point for prototyping and low-volume production use. Also available through Azure AI Foundry, AWS Bedrock, and Google Cloud Vertex AI for organizations already committed to major cloud platforms.
Deploy on your own GPU infrastructure using vLLM, Text Generation Inference, or Ollama. Eliminates per-token costs at scale, keeps all data on-premises, and allows full custom configuration of inference parameters, batching, and caching strategies.
Apache 2.0 permits fine-tuning on proprietary datasets. Use LoRA or QLoRA for parameter-efficient adaptation without retraining the full model. Fine-tuned versions can be deployed privately or distributed commercially, creating derivative product opportunities unavailable with restricted models.
GGUF quantized versions enable local deployment on high-end workstations and Macs with large unified memory. Suitable for development environments, air-gapped deployments, and hardware with 64–192 GB of available memory depending on quantization level.
For regulated industries — healthcare, finance, legal, government — the ability to deploy entirely on-premises is often not just preferable but required. Mistral Small 4's combination of frontier capability and unrestricted private deployment makes it one of the few models that can realistically serve these markets. The Apache 2.0 license also simplifies legal review, as it is a well-understood standard license rather than a custom document requiring specialized legal analysis.
Comparing Mistral Small 4 to Alternatives
Choosing the right model involves weighing capability, cost, deployment flexibility, and license terms against your specific application requirements. Mistral Small 4 occupies a distinctive position in this landscape as a frontier-capable, fully open-weight model with no deployment restrictions.
vs. GPT-4o / Claude 3.5 Sonnet (Proprietary)
Proprietary models still lead on the most demanding benchmarks and have larger ecosystems of integrations. Mistral Small 4 offers comparable quality on most practical tasks with the advantage of no vendor lock-in, private deployment, and no per-token costs at scale. The gap has narrowed substantially.
vs. Llama 4 Scout / Llama 4 Maverick (Meta)
Llama 4 Scout is a lighter MoE model optimized for speed and low-cost inference. Maverick is a larger MoE with a non-commercial license for large deployments. Mistral Small 4 competes closely on quality while maintaining its fully open Apache 2.0 license across all deployment scales.
vs. Qwen 2.5 72B (Alibaba)
Qwen 2.5 72B is a strong dense model with broad multilingual support, particularly for Chinese. Mistral Small 4's MoE architecture provides better quality per inference FLOP, and its European multilingual coverage is stronger. License terms are comparable.
vs. DeepSeek V3 / R1 (DeepSeek)
DeepSeek R1 specializes in reasoning and can outperform Mistral Small 4 in extended thinking mode on pure logic and math benchmarks. V3 leads on general capability at scale. For organizations with data sovereignty concerns, Mistral's European origin and Apache 2.0 license may be decisive.
The clearest competitive advantage for Mistral Small 4 is the combination of frontier-capable multimodal performance, extended thinking, and an unrestricted license — no other model in this quality tier currently offers all three simultaneously. For organizations where data sovereignty, private deployment, or the ability to fine-tune and redistribute matter, this combination is difficult to replicate with any competing offering.
Practical Use Cases and Business Applications
Understanding benchmark performance is useful, but the real question for most organizations is which specific workflows benefit most from deploying Mistral Small 4. The model's combination of capabilities maps cleanly to several high-value business applications that previously required either proprietary APIs or multiple specialized models.
Process invoices, contracts, and forms with combined OCR and natural language understanding in a single pass. Extract structured data, validate against business rules, and generate workflow actions — all within one model call without stitching specialized OCR and NLP pipelines together.
Build code review, generation, and documentation tools on a self-hosted model to avoid sending proprietary code to external APIs. The extended thinking mode handles complex refactoring and architecture decisions that require systematic multi-step reasoning about code structure.
Deploy a single model for customer-facing support across 30+ languages. The model's strong non-English reasoning quality means responses are genuinely helpful in the user's native language, not just grammatically fluent translations of English-language reasoning.
Build agents that research topics, analyze reports, interpret data visualizations, and synthesize findings across large document collections. The 128K context and extended thinking mode make systematic analysis of complex multi-source inputs tractable within a single inference call.
For digital agencies and consultancies, the model opens a practical path to building AI-powered client deliverables and internal tools that can be fully branded and deployed on client infrastructure without ongoing API costs or data sharing concerns. The Apache 2.0 license means these tools can be packaged as products and sold or licensed to clients, unlike tools built on APIs with usage restrictions. Organizations looking to integrate AI capabilities into their broader digital transformation strategy will find Mistral Small 4 a uniquely flexible foundation.
Conclusion
Mistral Small 4 represents a meaningful advance in the open-weight model landscape. Combining 119B MoE parameters with native vision, extended thinking, 128K context, and an Apache 2.0 license addresses the four most common limitations of previous open models: pure text-only input, lack of reasoning depth, constrained context, and restrictive licensing. The result is a model that serves as a credible foundation for a wide range of production AI applications without the privacy trade-offs or vendor dependencies of proprietary APIs.
For teams evaluating open-weight models in 2026, Mistral Small 4 belongs on the shortlist for any deployment scenario where full control over infrastructure, data residency, and fine-tuning are relevant constraints. Its compute efficiency through MoE means the operational costs of running it at scale are competitive with much smaller dense models, and the quality ceiling is high enough to handle the sophisticated tasks that justify that deployment effort.
Ready to Deploy Open-Weight AI?
Selecting and deploying the right AI model is one component of a broader digital transformation strategy. Our team helps businesses evaluate, fine-tune, and integrate open-weight models like Mistral Small 4 into production workflows that deliver measurable ROI.
Related Articles
Continue exploring with these related guides