Frontier-model energy use stopped scaling linearly with capability in mid-2025 — the architectural compression of mixture-of-experts and aggressive KV-cache optimization broke the previous trend line. But absolute consumption is still climbing 38% year-over- year because deployment volume is outpacing per-query efficiency by a wider margin than ever.

This 2026 quarterly compiles per-query energy across the four leading frontier models, the training-vs-inference split (which crossed over in 2025 — inference now dominates), water-use data from datacenter operators, and the trend analysis from the Stanford AI Index 2026 environmental section.

The numbers are best-available estimates derived from public architecture data, ML.Energy per-FLOP measurements, and reasonable assumptions about data-center PUE and grid mix. They're not first-party figures from the model providers (none publish per-query energy yet) but they are the best quantitative view the field has compiled.

Key takeaways

01
Per-query energy is no longer scaling linearly with capability — MoE and KV optimization broke the trend.GPT-4 (mid-2024 estimate): ~0.95 Wh per chat query. GPT-5.5 (2026): ~0.84 Wh. Capability roughly doubled; energy dropped 12%. The MoE architectural shift is the primary driver. The headline-grabbing 'AI is using more energy than ever' framing is true on absolute volume, false on per-query efficiency.
02
Inference now dominates lifecycle energy — passed training in 2025, currently 63% of total.In 2023-2024 the energy conversation was dominated by training cost. In 2025, inference deployment volume scaled faster than training (3-5× per year vs 1.5×) and the lifecycle split flipped. By 2026, ~63% of total frontier-model lifecycle energy is inference, ~37% is training — a complete inversion in two years.
03
Long-context calls are 10-20× more energy-intensive than short-context chat.An 800K-token context call on Opus 4.7 averages ~14.1 Wh per query, vs ~0.7 Wh for a typical short chat. The ratio is wider than the token-count ratio because attention scales superlinearly. Long-context workflows have outsized per-call sustainability impact; cache discipline reduces it dramatically.
04
Reasoning-mode queries (Gemini 3 Deep Think, GPT-5.5 Pro Thinking) draw 5-12× more energy than standard inference.Deep-Think and Thinking modes generate long internal reasoning traces before outputting an answer. The compute (and energy) for those traces is real. A single Gemini 3 Deep Think reasoning query averages ~6.2 Wh; a non-reasoning Gemini 3 query averages ~0.6 Wh. Apply reasoning-mode selectively — for tasks that demand it.
05
Datacenter water use scales 1.8L per kWh on average; 4.3L in air-cooled hot regions.Water consumption (mostly evaporative cooling) is the often-overlooked second sustainability axis. Average data-center water use across the major hyperscalers is 1.8L per kWh of compute energy. In air-cooled hot regions (Phoenix, Singapore), it climbs to 4.3L. For high-water-stress regions, this is becoming a real siting constraint.

01 — Headline NumbersThe 2026 headline numbers.

Three numbers anchor the 2026 sustainability picture: total data-center electricity attributable to AI, year-over-year growth in that number, and per-query energy at the frontier. All three are larger than the 2024 baselines, but the per-query trend has broken the linear-with-capability scaling that earlier eras assumed.

Number 1

Total AI datacenter electricity

~210 TWh in 2026 (Stanford AI Index estimate)

Up from ~150 TWh in 2024. Roughly 0.7% of global electricity. Growth driven by deployment volume — not by per-query intensity.

210 TWh · +38% YoY

Number 2

Per-query energy decoupled from capability

GPT-4 → GPT-5.5: capability 2×, per-query energy −12%

MoE compression and KV optimization broke the linear scaling. The capability-vs-energy frontier shifted in 2025 and held in 2026. Each generation now does more for less per call.

Architecture matters

Number 3

Inference passed training in 2025

Lifecycle energy split: ~63% inference / ~37% training

Total frontier-model lifecycle energy is now dominated by inference, not training. The flip happened mid-2025 as deployment volume grew faster than new model training. Implication: per-query optimization is now where sustainability gains live.

63% inference share

02 — Per QueryEnergy per query, by model and call type.

Per-query energy varies by model architecture, query type, and context length — by 20× or more across the realistic spread. The numbers below are estimates for typical call profiles, normalized to data-center kWh including PUE overhead. They represent the operational cost of a single user-facing call.

Per-query energy · 7 frontier-model call profiles

Source: ML.Energy + Stanford AI Index 2026 + internal estimate · Apr 2026

Standard chat · GPT-5.5~500 input tokens, ~300 output tokens

0.84 Wh

Standard chat · Gemini 3~500 input, ~300 output

0.61 Wh

lowest standard

Standard chat · Claude Opus 4.7~500 input, ~300 output

0.78 Wh

Reasoning · GPT-5.5 Pro ThinkingLong reasoning trace · 5-15K thinking tokens

5.1 Wh

Reasoning · Gemini 3 Deep ThinkLong reasoning trace · 8-20K thinking tokens

6.2 Wh

Long-context · Opus 4.7 · 800K inputFull long-document reasoning call

14.1 Wh

highest realistic

Long-context · Opus 4.7 · 1M inputFull window long-context reasoning

18.3 Wh

The reasonable read: standard chat queries land in the 0.6-0.9 Wh band — essentially equivalent to running a 60W laptop for 45-60 seconds. Reasoning queries scale 6-10× higher because of the internal reasoning traces. Long-context queries scale 15-25× higher because of attention's superlinear cost. Pick the mode that fits the task; reasoning and long-context are genuinely energy-intensive and shouldn't be the default.

Per-token vs per-query

Most public "AI energy per query" numbers don't distinguish reasoning calls, long-context calls, and standard chat. They average across an unweighted query mix. The 2-3× difference between "average across all calls" and "a typical chat call" matters when comparing models. Always check whether the published number is for standard chat, all-call average, or worst-case reasoning.

03 — WaterWater use, cooling, and the regional question.

The water-use story is more regional than the energy story. Average data-center water consumption is 1.8L per kWh of compute energy, dominated by evaporative cooling for heat rejection. In hot, water-stressed regions (Phoenix, Singapore, parts of Texas) it climbs to 3-5L per kWh. In cool, water- abundant regions (Iceland, Pacific Northwest, parts of Scandinavia) it drops below 0.5L. Siting matters.

Average

1.8L/kWh

Datacenter water use · global average

Across major hyperscalers and colocation operators, average water consumption is 1.8L per kWh. Dominant share is evaporative cooling for heat rejection. Closed-loop cooling reduces this dramatically but adds capex.

Global average

Hot regions

4.3L/kWh

Phoenix, Texas, Singapore

Air-cooled regions in hot climates. The thermodynamics of evaporative cooling are unfavorable; water consumption rises to 3-5× the average. Regional water-stress concerns and local regulation are now real siting constraints.

High-stress siting

Cool regions

0.4L/kWh

Iceland, PNW, Scandinavia

Cool climates with abundant water. Closed-loop cooling more effective; less evaporation needed for heat rejection. AI siting in these regions has accelerated meaningfully in 2025-2026 as the water question becomes a corporate-procurement concern.

Low-impact siting

"For high-water-stress regions, datacenter water use is becoming the binding sustainability constraint — bigger than electricity by 2027 in some siting decisions."— Internal stakeholder briefing, May 2026

04 — Lifecycle SplitTraining vs inference — the flip happened in 2025.

The 2023-2024 narrative was that training dominated AI's energy footprint. That was true at the time — large frontier models took 50-200 GWh to train, and inference deployment volumes were orders of magnitude lower than they are now. By 2025, the math flipped: deployment volume scaled 3-5× per year, training spend scaled ~1.5× per year, and inference passed training as a share of lifecycle energy.

Era

2023-2024 — training dominates

Frontier model training runs at 50-200 GWh per major model. Inference deployment volume was 100M-1B queries/day for the largest deployments. Lifecycle split: roughly 70% training / 30% inference. Public sustainability discussion focused on training cost.

Training-dominant era

Era

2025 — the flip

Deployment volume grew 3-5× through ChatGPT mass adoption, Copilot rollout, AI features in major SaaS, and the agentic-AI wave. Training spend grew ~1.5×. By Q3 2025, inference passed training in lifecycle share for the first time.

Crossover year

Era

2026 — inference dominates

Estimated lifecycle split: ~63% inference / ~37% training. Per-query optimization (KV cache, MoE compression, FP8 quantization) is now the primary lever for sustainability gains. Training-cost reduction is still important but less load-bearing on absolute energy.

Inference-dominant era

Era

2027-2028 — projection

If deployment volume continues growing at 2-3× per year and training spend at 1.3-1.5×, the inference share will climb to 75-80% of lifecycle energy by end of 2027. Per-query optimization becomes overwhelmingly the dominant sustainability lever.

Forecast

05 — TrendThe trend line — decoupled per-query, growing absolute.

Two trend lines tell the 2024-2026 story. Per-query energy is slightly down (-15% over two years, despite ~2× capability improvement) thanks to MoE and KV optimization. Absolute energy is up significantly (+38% YoY) because deployment volume is growing faster than per-query efficiency. Both numbers are simultaneously true; the framing depends on which one matters for the question being asked.

"AI is using more energy than ever — and each query is using less. Both true. Pick the right number for the question."— Stanford AI Index 2026, environmental section

06 — Practical ActionsWhat teams can actually do.

Use reasoning mode selectively. Reasoning traces (Deep Think, GPT-5.5 Pro Thinking) cost 5-12× the energy of standard inference. Default to non-reasoning; opt into reasoning for tasks that genuinely require it (math, complex agent planning, ambiguous reasoning over data).
Cache aggressively at long context. An 800K cached call costs ~10% of an uncached one. For long-context workloads, cache topology is also the biggest sustainability lever the deployment owns. The cost gain and the energy gain point the same direction.
Trim output amplification. A 200K input that elicits 10K of output uses dramatically more energy than the same input eliciting 500 output tokens. Set explicitmax_tokens budgets and output-shape system prompts to reduce per-call output by 60-80% on long-context workloads.
Pick smaller models when capability allows.A 70B-class open-weight model running at FP8 uses dramatically less energy per query than a 1.6T MoE model. For workloads where capability isn't the bottleneck — summarization, classification, routing — use smaller models. The energy savings compound at deployment volume.
Prefer providers with low-carbon grid mix and closed-loop cooling. Pacific Northwest, Iceland, Quebec, Scandinavia. The grid-mix difference between coal- heavy and renewable-heavy regions is 5-10× on emissions per query. Where data-residency allows, pick low-carbon regions.

07 — ConclusionThe sustainability frontier moved.

2026 sustainability picture, April 2026

Per-query is decoupling from capability. Absolute is still climbing. Both matter.

By April 2026 the AI sustainability picture is more nuanced than it has ever been. The good news: architectural compression (MoE, MLA, FP8) has decoupled per-query energy from capability improvement — each generation does more for less per call. The harder news: deployment volume is outpacing per-query efficiency by enough that absolute energy is still climbing 38% YoY. Both numbers are simultaneously true, and the framing depends on the question being asked.

The sustainability lever has moved with the lifecycle split. In 2023-2024, training-cost optimization was where the marginal sustainability gains lived. In 2026, inference per-query optimization is the dominant lever — KV-cache discipline, output trim, model-right-sizing, low-carbon-grid siting. These are practical, deployable choices that map onto teams' normal cost-and-quality decisions.

For agency and product teams, the practical takeaway is to measure per-query energy as a first-class metric alongside cost-per-answer and latency. The same operational discipline that produces the cheapest deployment usually produces the most sustainable one — and the alignment between cost and sustainability is one of the field's most important structural truths.

AI Model Sustainability 2026: Energy Use Data