AI Development

Google Gemma 4: Apache 2.0 Open-Source Complete Guide

Google Gemma 4 complete guide covering all four variants from 2.3B to 31B parameters. Apache 2.0 license, 128K-256K context, multimodal, Arena #3 open model.

Digital Applied Team

April 2, 2026

12 min read

Key Takeaways

Apache 2.0 License: marks Google's first truly open-source release in the Gemma family, removing prior custom license restrictions on commercial use and modification

Four Model Variants: span from 2.3B to 31B parameters, covering edge devices through enterprise deployments with both Dense and MoE architectures

Arena #3 Open Model: the 31B Dense variant reportedly ranks third globally among open models on Arena AI, outperforming models many times its size

Frontier-Class Benchmarks: include GPQA Diamond at 85.2%, AIME 2026 at 89.2%, and LiveCodeBench v6 at 80.0% for the 31B variant

Multimodal by Default: all variants natively process images and video, with the smaller E2B and E4B models also supporting audio input

Enterprise-Ready Agentic Skills: native function calling, structured JSON output, and system instructions enable production-grade autonomous agent workflows

85.2%

GPQA Diamond (31B)

89.2%

AIME 2026 (31B)

Arena Open Model Rank

256K

Max Context Tokens

The Apache 2.0 License Shift: Why It Matters More Than Benchmarks

When Google released Gemma 1, 2, and 3, each came with a custom license that imposed meaningful restrictions on commercial use. The Gemma Terms of Use limited redistribution, required attribution in specific formats, and restricted use for applications exceeding certain monthly active user thresholds. For enterprises evaluating open-source AI, these restrictions created legal ambiguity that often steered procurement teams toward alternatives with cleaner licensing.

Gemma 4 changes this entirely. By releasing under the Apache 2.0 license, an OSI-approved open-source license, Google has removed every meaningful barrier to commercial adoption. Apache 2.0 grants irrevocable rights to use, reproduce, modify, and distribute the software in any form, including for commercial purposes, with no royalty requirements and no user limits. For organizations building AI-powered digital transformation initiatives, this eliminates the single largest non-technical risk in open-model deployment.

Previous Gemma License

Custom Google Terms of Use

Redistribution restrictions on modified weights
Specific attribution format requirements
User threshold limitations for commercial use
Ambiguous enterprise compliance requirements

Gemma 4 Apache 2.0

OSI-Approved Open Source

Full commercial use with no user limits
Unrestricted modification and redistribution
Clear patent grant protections
Enterprise-friendly compliance profile

Strategic context: This license shift reportedly aligns Google with Apache 2.0 models from Alibaba (Qwen) and Mistral, while differentiating from Meta's Llama license which retains restrictions for applications exceeding 700 million monthly active users. For most enterprises, Apache 2.0 represents the lowest legal risk among open-model licenses.

The business implications extend beyond legal departments. Apache 2.0 enables three capabilities that were previously restricted: fine-tuning on proprietary data and distributing the resulting weights, embedding models in commercial products without per-user licensing calculations, and building derivative models that can be released under any compatible license. For agencies and enterprises building agentic AI systems with open-source foundations, this is a significant development in the model selection landscape.

All Four Variants Compared: From Edge to Enterprise

Gemma 4 ships in four distinct configurations, each targeting different deployment scenarios. The family spans two architectural approaches, Dense and Mixture-of-Experts (MoE), and introduces the concept of "effective parameters" for the smaller variants, reflecting their use of Per-Layer Embeddings (PLE) to maximize parameter efficiency.

Gemma 4 Model Family Specifications

Specification	E2B	E4B	26B MoE	31B Dense
Parameters	2.3B effective	4.5B effective	26B total / 4B active	31B
Architecture	Dense + PLE	Dense + PLE	Mixture-of-Experts	Dense
Context Window	128K tokens	128K tokens	256K tokens	256K tokens
Modalities	Text, Image, Video, Audio	Text, Image, Video, Audio	Text, Image, Video	Text, Image, Video
VRAM (4-bit)	~5 GB	~5 GB	~18 GB	~20 GB
Target Deployment	Mobile / IoT	Edge / Desktop	Server (cost-efficient)	Server (max capability)
Arena Rank (Open)	N/A	N/A	#6	#3

E2B and E4B: On-Device Intelligence

The "E" in E2B and E4B stands for "effective" parameters. These models use Per-Layer Embeddings (PLE), a technique that maximizes parameter efficiency for on-device deployments. At approximately 5GB of VRAM with 4-bit quantization, both variants can run on modern smartphones and lightweight edge hardware. Notably, E2B and E4B are the only variants that support audio input (up to 30 seconds), making them suitable for voice-driven mobile applications.

26B MoE: Cost-Efficient Server Deployment

The 26B Mixture-of-Experts variant contains 26 billion total parameters but activates only 4 billion per token. This architecture delivers performance close to the 31B Dense model at significantly lower inference cost. With 256K context and approximately 18GB VRAM at 4-bit quantization, it fits on a single consumer GPU while reportedly ranking #6 among open models on the Arena AI leaderboard. For organizations running high-throughput inference at scale, the MoE architecture offers the strongest cost-to-performance ratio in the Gemma 4 family.

31B Dense: Maximum Capability

The flagship 31B Dense model delivers the highest benchmark scores and reportedly holds the #3 ranking among open models on Arena AI. At approximately 20GB VRAM with 4-bit quantization, it remains accessible on high-end consumer hardware like the NVIDIA RTX 4090. Its 256K token context window supports complex document analysis, lengthy code generation, and multi-turn agentic workflows.

Hybrid attention architecture: All Gemma 4 variants employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always being global. This design reportedly delivers the processing speed of smaller models without sacrificing deep awareness for complex, long-context tasks.

Benchmark Performance Analysis

Gemma 4's benchmark results position it as a strong contender across reasoning, mathematics, science, and code generation. The following analysis covers the reported scores for the two largest variants, which are the most relevant for server-side enterprise applications.

Gemma 4 Benchmark Scores

Reported scores for 31B Dense and 26B MoE variants

Benchmark	Category	31B Dense	26B MoE
GPQA Diamond	Science Reasoning	85.2%	82.6%
AIME 2026	Mathematics	89.2%	88.3%
LiveCodeBench v6	Code Generation	80.0%	77.1%
Arena AI (Elo)	Human Preference	>1,440	>1,440

What These Numbers Mean in Practice

An 89.2% score on AIME 2026 (without tool use) places Gemma 4 31B in the upper echelon of mathematical reasoning. For context, competition-level math problems at this difficulty are designed to challenge advanced students and many proprietary models. The GPQA Diamond benchmark, which tests graduate-level science reasoning, shows similarly strong results at 85.2%. These scores reportedly outperform many models with significantly more parameters.

The LiveCodeBench v6 score of 80.0% reflects practical code generation ability across real-world programming tasks. For teams evaluating AI coding assistants and development tools, this positions Gemma 4 as a viable self-hosted alternative to proprietary coding models, particularly where data privacy or licensing concerns preclude API-based solutions.

Science

85.2% on GPQA Diamond demonstrates strong graduate-level reasoning across physics, chemistry, and biology domains.

Mathematics

89.2% on AIME 2026 without tool use, placing it among the strongest open models for mathematical reasoning.

Coding

80.0% on LiveCodeBench v6 validates strong code generation across practical, real-world programming tasks.

Benchmark caveat: All benchmark scores are self-reported by Google. Independent third-party evaluation is still ongoing. Arena AI rankings, based on blind human preference voting, provide an additional independent signal but reflect conversational quality rather than task-specific accuracy.

Multimodal and Agentic Capabilities

Gemma 4 is natively multimodal across the entire family. All four variants process images and video (up to 60 seconds at 1 FPS), with the smaller E2B and E4B models adding audio support (up to 30 seconds). The models support interleaved multimodal input, meaning text and images can be freely mixed in any order within a single prompt.

Visual Understanding

Gemma 4 handles variable-resolution inputs and reportedly excels at visual tasks including optical character recognition (OCR), chart interpretation, document analysis, and diagram understanding. For enterprise workflows involving document processing, invoice extraction, or visual quality assurance, the ability to run these capabilities on self-hosted infrastructure under Apache 2.0 opens deployment scenarios that were previously limited to proprietary vision APIs.

Agentic Function Calling

All Gemma 4 variants include native support for function calling, structured JSON output, and system instructions. According to Google's developer documentation, these capabilities enable building autonomous agents that interact with tools, APIs, and external services. The inclusion of constrained decoding ensures structured outputs remain valid and predictable, which is critical for production agent pipelines.

# Example: Gemma 4 function calling with Ollama
# Install the model
ollama pull gemma4:31b

# Define tools in your application
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_documents",
            "description": "Search internal documents by query",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "max_results": {"type": "integer", "default": 5}
                },
                "required": ["query"]
            }
        }
    }
]

# The model natively understands tool schemas
# and generates structured function calls

For organizations exploring AI agent orchestration and workflow automation, Gemma 4's combination of Apache 2.0 licensing, native function calling, and strong reasoning benchmarks makes it a compelling candidate for self-hosted agent infrastructure. The ability to fine-tune on domain-specific tool schemas and deploy without licensing restrictions is particularly valuable for regulated industries.

Multimodal Input

Variable-resolution image understanding
Video processing up to 60 seconds (1 FPS)
Audio input on E2B/E4B (up to 30 seconds)
Interleaved text and image prompts
OCR, chart, and document analysis

Agentic Features

Native function calling with tool schemas
Structured JSON output generation
Native system instruction support
Constrained decoding for reliable outputs
LiteRT-LM CLI tool calling support

Deployment and Hardware Guide

One of Gemma 4's practical advantages is its breadth of deployment options. From mobile phones to multi-GPU servers, the four-variant family covers most hardware configurations. Gemma 4 is available through Google AI Studio, Hugging Face, Ollama, and major cloud providers, with Day 0 optimization support from NVIDIA, AMD, and Arm.

Running Gemma 4 Locally With Ollama

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Gemma 4 variants
ollama pull gemma4:2b       # E2B - ~5GB VRAM
ollama pull gemma4:4b       # E4B - ~5GB VRAM
ollama pull gemma4:26b      # 26B MoE - ~18GB VRAM
ollama pull gemma4:31b      # 31B Dense - ~20GB VRAM

# Run with custom parameters
ollama run gemma4:31b --context-length 65536

# Expose OpenAI-compatible API
# Default: http://localhost:11434/v1/chat/completions

Production Deployment With vLLM

# Install vLLM with Gemma 4 support
pip install vllm

# Serve the 31B model with tensor parallelism
vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --max-model-len 65536 \
  --dtype bfloat16

# Or serve the MoE variant for cost efficiency
vllm serve google/gemma-4-26B-A4B-it \
  --max-model-len 65536 \
  --dtype bfloat16

Hardware Requirements by Deployment Tier

Deployment Tier	Model	GPU (4-bit)	GPU (8-bit)
Mobile / IoT	E2B	~5 GB	~8 GB
Edge / Desktop	E4B	~5 GB	~15 GB
Single GPU Server	26B MoE	~18 GB (RTX 4090)	~28 GB (A100 40GB)
Multi-GPU / Cloud	31B Dense	~20 GB (RTX 4090)	~34 GB (A100 40GB)

Cloud Platform Access

Gemma 4 is accessible through multiple cloud platforms at launch. Google AI Studio provides direct access to the 31B and 26B variants for experimentation. Google AI Edge Gallery supports the E2B and E4B variants for on-device testing. Hugging Face hosts all variants with inference endpoints and downloadable weights. Major cloud inference providers including AWS, Google Cloud, and Azure are expected to offer hosted Gemma 4 endpoints.

Quantization guidance: Q4_K_M (the Ollama default) reduces memory by approximately 55-60% compared to bfloat16, with minor quality impact for instruction-following tasks. For tasks requiring maximum accuracy (complex reasoning, math), consider Q8 quantization where hardware allows. Unsloth provides optimized GGUF quantizations for all Gemma 4 variants.

Competitive Landscape: Gemma 4 vs. Llama 4 vs. Qwen 3.5

The open model landscape in early 2026 is intensely competitive. Gemma 4 enters a field where Meta's Llama 4, Alibaba's Qwen 3.5 and 3.6 families, and emerging competitors from DeepSeek and Mistral all target overlapping use cases. Understanding the trade-offs helps inform model selection for production deployments.

Open Model Comparison (April 2026)

Factor	Gemma 4 31B	Llama 4 Scout	Qwen 3.5 32B
License	Apache 2.0	Llama License (700M MAU limit)	Apache 2.0
Parameters	31B Dense	109B total / 17B active	32B Dense
Context	256K	10M	262K
Multimodal	Text, Image, Video	Text, Image	Text, Image
Inference Speed	Fast (Dense 31B)	Slower (MoE routing overhead)	Fast (Dense 32B)
Arena Rank (Open)	#3	Varies	Competitive

When to Choose Each Model

Choose Gemma 4 When

Maximum intelligence per parameter
Edge-to-server deployment needed
Video processing required
Clean Apache 2.0 needed for compliance

Choose Llama 4 When

Extreme context length (10M tokens)
Processing entire codebases at once
Meta ecosystem integration
Under 700M MAU threshold

Choose Qwen 3.5 When

Mathematics-heavy workloads
Widest range of model sizes needed
Strong multilingual requirements
Apache 2.0 with broader ecosystem

For a deeper analysis of the competitive dynamics among frontier open models, see our open model comparison guide. The rapid pace of releases, including the 12 AI models released in a single week in March 2026, underscores the importance of evaluating models against your specific workload rather than relying solely on aggregate benchmarks.

Business Implications and Strategy

Gemma 4's release under Apache 2.0 has implications that extend well beyond model selection. It reflects a broader shift in how major technology companies approach open-source AI, and creates specific opportunities for organizations at different stages of AI adoption.

For Enterprises Evaluating Self-Hosted AI

The combination of Apache 2.0 licensing, strong benchmarks, and efficient hardware requirements makes Gemma 4 a strong candidate for organizations exploring alternatives to proprietary API dependencies. Running inference on-premises or in a private cloud eliminates per-token API costs, provides full data sovereignty, and removes rate limiting constraints. With the 26B MoE variant fitting on a single consumer GPU, the capital expenditure barrier is significantly lower than previous generations of capable open models.

For Startups and Product Teams

Apache 2.0 enables product teams to embed Gemma 4 directly into commercial products without licensing overhead. This is particularly relevant for SaaS platforms that integrate AI features, mobile applications requiring on-device intelligence (using E2B or E4B), and development tools that benefit from code generation capabilities. The absence of user-count restrictions under Apache 2.0 means licensing costs do not scale with product success.

For Marketing and Content Teams

Gemma 4's multimodal capabilities open practical applications in content production workflows. The ability to analyze images, process video, and generate structured outputs means teams can build custom tools for visual content analysis, competitor monitoring, and automated reporting. For agencies managing content marketing at scale, a self-hosted multimodal model that can be fine-tuned on brand guidelines represents a meaningful operational advantage.

Cost Comparison: API vs. Self-Hosted Gemma 4

Proprietary API (1M tokens/day)

Input: ~$3-15/M tokens
Output: ~$10-60/M tokens
Monthly estimate: $300-1,800+
Data sent to third-party servers

Self-Hosted Gemma 4 26B MoE

GPU: RTX 4090 (~$1,600 one-time)
Electricity: ~$15-30/month
Unlimited tokens after hardware cost
Full data sovereignty maintained

The Broader Open-Source AI Trend

Google's move to Apache 2.0 accelerates a trend where the strongest open models increasingly rival proprietary offerings. This has strategic implications for how organizations budget for AI infrastructure, negotiate with cloud providers, and build internal AI capabilities. As explored in our analysis of enterprise AI agent adoption trends, the availability of capable, permissively licensed models is one of the key enablers of the shift toward embedded AI across business applications.

Partner ecosystem at launch: Google announced Day 0 optimization support from NVIDIA (RTX AI Garage), AMD (processors and GPUs), and Arm (on-device AI acceleration). This breadth of hardware partner support reduces deployment friction across both cloud and edge environments.

Deploy Open-Source AI With Confidence

From model selection and fine-tuning to production deployment, Digital Applied helps you harness open-source AI models like Gemma 4 to build scalable, cost-effective solutions tailored to your business needs.

Get Started Explore AI Transformation

Free consultation

Expert guidance

Tailored solutions