Google Gemma 4: Apache 2.0 Open-Source Complete Guide
Google Gemma 4 complete guide covering all four variants from 2.3B to 31B parameters. Apache 2.0 license, 128K-256K context, multimodal, Arena #3 open model.
Key Takeaways
GPQA Diamond (31B)
AIME 2026 (31B)
Arena Open Model Rank
Max Context Tokens
The Apache 2.0 License Shift: Why It Matters More Than Benchmarks
When Google released Gemma 1, 2, and 3, each came with a custom license that imposed meaningful restrictions on commercial use. The Gemma Terms of Use limited redistribution, required attribution in specific formats, and restricted use for applications exceeding certain monthly active user thresholds. For enterprises evaluating open-source AI, these restrictions created legal ambiguity that often steered procurement teams toward alternatives with cleaner licensing.
Gemma 4 changes this entirely. By releasing under the Apache 2.0 license, an OSI-approved open-source license, Google has removed every meaningful barrier to commercial adoption. Apache 2.0 grants irrevocable rights to use, reproduce, modify, and distribute the software in any form, including for commercial purposes, with no royalty requirements and no user limits. For organizations building AI-powered digital transformation initiatives, this eliminates the single largest non-technical risk in open-model deployment.
- Redistribution restrictions on modified weights
- Specific attribution format requirements
- User threshold limitations for commercial use
- Ambiguous enterprise compliance requirements
- Full commercial use with no user limits
- Unrestricted modification and redistribution
- Clear patent grant protections
- Enterprise-friendly compliance profile
The business implications extend beyond legal departments. Apache 2.0 enables three capabilities that were previously restricted: fine-tuning on proprietary data and distributing the resulting weights, embedding models in commercial products without per-user licensing calculations, and building derivative models that can be released under any compatible license. For agencies and enterprises building agentic AI systems with open-source foundations, this is a significant development in the model selection landscape.
All Four Variants Compared: From Edge to Enterprise
Gemma 4 ships in four distinct configurations, each targeting different deployment scenarios. The family spans two architectural approaches, Dense and Mixture-of-Experts (MoE), and introduces the concept of "effective parameters" for the smaller variants, reflecting their use of Per-Layer Embeddings (PLE) to maximize parameter efficiency.
| Specification | E2B | E4B | 26B MoE | 31B Dense |
|---|---|---|---|---|
| Parameters | 2.3B effective | 4.5B effective | 26B total / 4B active | 31B |
| Architecture | Dense + PLE | Dense + PLE | Mixture-of-Experts | Dense |
| Context Window | 128K tokens | 128K tokens | 256K tokens | 256K tokens |
| Modalities | Text, Image, Video, Audio | Text, Image, Video, Audio | Text, Image, Video | Text, Image, Video |
| VRAM (4-bit) | ~5 GB | ~5 GB | ~18 GB | ~20 GB |
| Target Deployment | Mobile / IoT | Edge / Desktop | Server (cost-efficient) | Server (max capability) |
| Arena Rank (Open) | N/A | N/A | #6 | #3 |
E2B and E4B: On-Device Intelligence
The "E" in E2B and E4B stands for "effective" parameters. These models use Per-Layer Embeddings (PLE), a technique that maximizes parameter efficiency for on-device deployments. At approximately 5GB of VRAM with 4-bit quantization, both variants can run on modern smartphones and lightweight edge hardware. Notably, E2B and E4B are the only variants that support audio input (up to 30 seconds), making them suitable for voice-driven mobile applications.
26B MoE: Cost-Efficient Server Deployment
The 26B Mixture-of-Experts variant contains 26 billion total parameters but activates only 4 billion per token. This architecture delivers performance close to the 31B Dense model at significantly lower inference cost. With 256K context and approximately 18GB VRAM at 4-bit quantization, it fits on a single consumer GPU while reportedly ranking #6 among open models on the Arena AI leaderboard. For organizations running high-throughput inference at scale, the MoE architecture offers the strongest cost-to-performance ratio in the Gemma 4 family.
31B Dense: Maximum Capability
The flagship 31B Dense model delivers the highest benchmark scores and reportedly holds the #3 ranking among open models on Arena AI. At approximately 20GB VRAM with 4-bit quantization, it remains accessible on high-end consumer hardware like the NVIDIA RTX 4090. Its 256K token context window supports complex document analysis, lengthy code generation, and multi-turn agentic workflows.
Benchmark Performance Analysis
Gemma 4's benchmark results position it as a strong contender across reasoning, mathematics, science, and code generation. The following analysis covers the reported scores for the two largest variants, which are the most relevant for server-side enterprise applications.
| Benchmark | Category | 31B Dense | 26B MoE |
|---|---|---|---|
| GPQA Diamond | Science Reasoning | 85.2% | 82.6% |
| AIME 2026 | Mathematics | 89.2% | 88.3% |
| LiveCodeBench v6 | Code Generation | 80.0% | 77.1% |
| Arena AI (Elo) | Human Preference | >1,440 | >1,440 |
What These Numbers Mean in Practice
An 89.2% score on AIME 2026 (without tool use) places Gemma 4 31B in the upper echelon of mathematical reasoning. For context, competition-level math problems at this difficulty are designed to challenge advanced students and many proprietary models. The GPQA Diamond benchmark, which tests graduate-level science reasoning, shows similarly strong results at 85.2%. These scores reportedly outperform many models with significantly more parameters.
The LiveCodeBench v6 score of 80.0% reflects practical code generation ability across real-world programming tasks. For teams evaluating AI coding assistants and development tools, this positions Gemma 4 as a viable self-hosted alternative to proprietary coding models, particularly where data privacy or licensing concerns preclude API-based solutions.
85.2% on GPQA Diamond demonstrates strong graduate-level reasoning across physics, chemistry, and biology domains.
89.2% on AIME 2026 without tool use, placing it among the strongest open models for mathematical reasoning.
80.0% on LiveCodeBench v6 validates strong code generation across practical, real-world programming tasks.
Multimodal and Agentic Capabilities
Gemma 4 is natively multimodal across the entire family. All four variants process images and video (up to 60 seconds at 1 FPS), with the smaller E2B and E4B models adding audio support (up to 30 seconds). The models support interleaved multimodal input, meaning text and images can be freely mixed in any order within a single prompt.
Visual Understanding
Gemma 4 handles variable-resolution inputs and reportedly excels at visual tasks including optical character recognition (OCR), chart interpretation, document analysis, and diagram understanding. For enterprise workflows involving document processing, invoice extraction, or visual quality assurance, the ability to run these capabilities on self-hosted infrastructure under Apache 2.0 opens deployment scenarios that were previously limited to proprietary vision APIs.
Agentic Function Calling
All Gemma 4 variants include native support for function calling, structured JSON output, and system instructions. According to Google's developer documentation, these capabilities enable building autonomous agents that interact with tools, APIs, and external services. The inclusion of constrained decoding ensures structured outputs remain valid and predictable, which is critical for production agent pipelines.
# Example: Gemma 4 function calling with Ollama
# Install the model
ollama pull gemma4:31b
# Define tools in your application
tools = [
{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search internal documents by query",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"max_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
}
]
# The model natively understands tool schemas
# and generates structured function callsFor organizations exploring AI agent orchestration and workflow automation, Gemma 4's combination of Apache 2.0 licensing, native function calling, and strong reasoning benchmarks makes it a compelling candidate for self-hosted agent infrastructure. The ability to fine-tune on domain-specific tool schemas and deploy without licensing restrictions is particularly valuable for regulated industries.
- Variable-resolution image understanding
- Video processing up to 60 seconds (1 FPS)
- Audio input on E2B/E4B (up to 30 seconds)
- Interleaved text and image prompts
- OCR, chart, and document analysis
- Native function calling with tool schemas
- Structured JSON output generation
- Native system instruction support
- Constrained decoding for reliable outputs
- LiteRT-LM CLI tool calling support
Deployment and Hardware Guide
One of Gemma 4's practical advantages is its breadth of deployment options. From mobile phones to multi-GPU servers, the four-variant family covers most hardware configurations. Gemma 4 is available through Google AI Studio, Hugging Face, Ollama, and major cloud providers, with Day 0 optimization support from NVIDIA, AMD, and Arm.
Running Gemma 4 Locally With Ollama
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Gemma 4 variants
ollama pull gemma4:2b # E2B - ~5GB VRAM
ollama pull gemma4:4b # E4B - ~5GB VRAM
ollama pull gemma4:26b # 26B MoE - ~18GB VRAM
ollama pull gemma4:31b # 31B Dense - ~20GB VRAM
# Run with custom parameters
ollama run gemma4:31b --context-length 65536
# Expose OpenAI-compatible API
# Default: http://localhost:11434/v1/chat/completionsProduction Deployment With vLLM
# Install vLLM with Gemma 4 support
pip install vllm
# Serve the 31B model with tensor parallelism
vllm serve google/gemma-4-31B-it \
--tensor-parallel-size 2 \
--max-model-len 65536 \
--dtype bfloat16
# Or serve the MoE variant for cost efficiency
vllm serve google/gemma-4-26B-A4B-it \
--max-model-len 65536 \
--dtype bfloat16Hardware Requirements by Deployment Tier
| Deployment Tier | Model | GPU (4-bit) | GPU (8-bit) |
|---|---|---|---|
| Mobile / IoT | E2B | ~5 GB | ~8 GB |
| Edge / Desktop | E4B | ~5 GB | ~15 GB |
| Single GPU Server | 26B MoE | ~18 GB (RTX 4090) | ~28 GB (A100 40GB) |
| Multi-GPU / Cloud | 31B Dense | ~20 GB (RTX 4090) | ~34 GB (A100 40GB) |
Cloud Platform Access
Gemma 4 is accessible through multiple cloud platforms at launch. Google AI Studio provides direct access to the 31B and 26B variants for experimentation. Google AI Edge Gallery supports the E2B and E4B variants for on-device testing. Hugging Face hosts all variants with inference endpoints and downloadable weights. Major cloud inference providers including AWS, Google Cloud, and Azure are expected to offer hosted Gemma 4 endpoints.
Competitive Landscape: Gemma 4 vs. Llama 4 vs. Qwen 3.5
The open model landscape in early 2026 is intensely competitive. Gemma 4 enters a field where Meta's Llama 4, Alibaba's Qwen 3.5 and 3.6 families, and emerging competitors from DeepSeek and Mistral all target overlapping use cases. Understanding the trade-offs helps inform model selection for production deployments.
| Factor | Gemma 4 31B | Llama 4 Scout | Qwen 3.5 32B |
|---|---|---|---|
| License | Apache 2.0 | Llama License (700M MAU limit) | Apache 2.0 |
| Parameters | 31B Dense | 109B total / 17B active | 32B Dense |
| Context | 256K | 10M | 262K |
| Multimodal | Text, Image, Video | Text, Image | Text, Image |
| Inference Speed | Fast (Dense 31B) | Slower (MoE routing overhead) | Fast (Dense 32B) |
| Arena Rank (Open) | #3 | Varies | Competitive |
When to Choose Each Model
- Maximum intelligence per parameter
- Edge-to-server deployment needed
- Video processing required
- Clean Apache 2.0 needed for compliance
- Extreme context length (10M tokens)
- Processing entire codebases at once
- Meta ecosystem integration
- Under 700M MAU threshold
- Mathematics-heavy workloads
- Widest range of model sizes needed
- Strong multilingual requirements
- Apache 2.0 with broader ecosystem
For a deeper analysis of the competitive dynamics among frontier open models, see our open model comparison guide. The rapid pace of releases, including the 12 AI models released in a single week in March 2026, underscores the importance of evaluating models against your specific workload rather than relying solely on aggregate benchmarks.
Business Implications and Strategy
Gemma 4's release under Apache 2.0 has implications that extend well beyond model selection. It reflects a broader shift in how major technology companies approach open-source AI, and creates specific opportunities for organizations at different stages of AI adoption.
For Enterprises Evaluating Self-Hosted AI
The combination of Apache 2.0 licensing, strong benchmarks, and efficient hardware requirements makes Gemma 4 a strong candidate for organizations exploring alternatives to proprietary API dependencies. Running inference on-premises or in a private cloud eliminates per-token API costs, provides full data sovereignty, and removes rate limiting constraints. With the 26B MoE variant fitting on a single consumer GPU, the capital expenditure barrier is significantly lower than previous generations of capable open models.
For Startups and Product Teams
Apache 2.0 enables product teams to embed Gemma 4 directly into commercial products without licensing overhead. This is particularly relevant for SaaS platforms that integrate AI features, mobile applications requiring on-device intelligence (using E2B or E4B), and development tools that benefit from code generation capabilities. The absence of user-count restrictions under Apache 2.0 means licensing costs do not scale with product success.
For Marketing and Content Teams
Gemma 4's multimodal capabilities open practical applications in content production workflows. The ability to analyze images, process video, and generate structured outputs means teams can build custom tools for visual content analysis, competitor monitoring, and automated reporting. For agencies managing content marketing at scale, a self-hosted multimodal model that can be fine-tuned on brand guidelines represents a meaningful operational advantage.
Cost Comparison: API vs. Self-Hosted Gemma 4
Proprietary API (1M tokens/day)
- Input: ~$3-15/M tokens
- Output: ~$10-60/M tokens
- Monthly estimate: $300-1,800+
- Data sent to third-party servers
Self-Hosted Gemma 4 26B MoE
- GPU: RTX 4090 (~$1,600 one-time)
- Electricity: ~$15-30/month
- Unlimited tokens after hardware cost
- Full data sovereignty maintained
The Broader Open-Source AI Trend
Google's move to Apache 2.0 accelerates a trend where the strongest open models increasingly rival proprietary offerings. This has strategic implications for how organizations budget for AI infrastructure, negotiate with cloud providers, and build internal AI capabilities. As explored in our analysis of enterprise AI agent adoption trends, the availability of capable, permissively licensed models is one of the key enablers of the shift toward embedded AI across business applications.
Frequently Asked Questions
Related Articles
Continue exploring with these related guides