Frontier-model energy use stopped scaling linearly with capability in mid-2025 — the architectural compression of mixture-of-experts and aggressive KV-cache optimization broke the previous trend line. But absolute consumption is still climbing 38% year-over- year because deployment volume is outpacing per-query efficiency by a wider margin than ever.
This 2026 quarterly compiles per-query energy across the four leading frontier models, the training-vs-inference split (which crossed over in 2025 — inference now dominates), water-use data from datacenter operators, and the trend analysis from the Stanford AI Index 2026 environmental section.
The numbers are best-available estimates derived from public architecture data, ML.Energy per-FLOP measurements, and reasonable assumptions about data-center PUE and grid mix. They're not first-party figures from the model providers (none publish per-query energy yet) but they are the best quantitative view the field has compiled.
- 01Per-query energy is no longer scaling linearly with capability — MoE and KV optimization broke the trend.GPT-4 (mid-2024 estimate): ~0.95 Wh per chat query. GPT-5.5 (2026): ~0.84 Wh. Capability roughly doubled; energy dropped 12%. The MoE architectural shift is the primary driver. The headline-grabbing 'AI is using more energy than ever' framing is true on absolute volume, false on per-query efficiency.
- 02Inference now dominates lifecycle energy — passed training in 2025, currently 63% of total.In 2023-2024 the energy conversation was dominated by training cost. In 2025, inference deployment volume scaled faster than training (3-5× per year vs 1.5×) and the lifecycle split flipped. By 2026, ~63% of total frontier-model lifecycle energy is inference, ~37% is training — a complete inversion in two years.
- 03Long-context calls are 10-20× more energy-intensive than short-context chat.An 800K-token context call on Opus 4.7 averages ~14.1 Wh per query, vs ~0.7 Wh for a typical short chat. The ratio is wider than the token-count ratio because attention scales superlinearly. Long-context workflows have outsized per-call sustainability impact; cache discipline reduces it dramatically.
- 04Reasoning-mode queries (Gemini 3 Deep Think, GPT-5.5 Pro Thinking) draw 5-12× more energy than standard inference.Deep-Think and Thinking modes generate long internal reasoning traces before outputting an answer. The compute (and energy) for those traces is real. A single Gemini 3 Deep Think reasoning query averages ~6.2 Wh; a non-reasoning Gemini 3 query averages ~0.6 Wh. Apply reasoning-mode selectively — for tasks that demand it.
- 05Datacenter water use scales 1.8L per kWh on average; 4.3L in air-cooled hot regions.Water consumption (mostly evaporative cooling) is the often-overlooked second sustainability axis. Average data-center water use across the major hyperscalers is 1.8L per kWh of compute energy. In air-cooled hot regions (Phoenix, Singapore), it climbs to 4.3L. For high-water-stress regions, this is becoming a real siting constraint.
01 — Headline NumbersThe 2026 headline numbers.
Three numbers anchor the 2026 sustainability picture: total data-center electricity attributable to AI, year-over-year growth in that number, and per-query energy at the frontier. All three are larger than the 2024 baselines, but the per-query trend has broken the linear-with-capability scaling that earlier eras assumed.
Total AI datacenter electricity
~210 TWh in 2026 (Stanford AI Index estimate)Up from ~150 TWh in 2024. Roughly 0.7% of global electricity. Growth driven by deployment volume — not by per-query intensity.
210 TWh · +38% YoYPer-query energy decoupled from capability
GPT-4 → GPT-5.5: capability 2×, per-query energy −12%MoE compression and KV optimization broke the linear scaling. The capability-vs-energy frontier shifted in 2025 and held in 2026. Each generation now does more for less per call.
Architecture mattersInference passed training in 2025
Lifecycle energy split: ~63% inference / ~37% trainingTotal frontier-model lifecycle energy is now dominated by inference, not training. The flip happened mid-2025 as deployment volume grew faster than new model training. Implication: per-query optimization is now where sustainability gains live.
63% inference share02 — Per QueryEnergy per query, by model and call type.
Per-query energy varies by model architecture, query type, and context length — by 20× or more across the realistic spread. The numbers below are estimates for typical call profiles, normalized to data-center kWh including PUE overhead. They represent the operational cost of a single user-facing call.
Per-query energy · 7 frontier-model call profiles
Source: ML.Energy + Stanford AI Index 2026 + internal estimate · Apr 2026The reasonable read: standard chat queries land in the 0.6-0.9 Wh band — essentially equivalent to running a 60W laptop for 45-60 seconds. Reasoning queries scale 6-10× higher because of the internal reasoning traces. Long-context queries scale 15-25× higher because of attention's superlinear cost. Pick the mode that fits the task; reasoning and long-context are genuinely energy-intensive and shouldn't be the default.
03 — WaterWater use, cooling, and the regional question.
The water-use story is more regional than the energy story. Average data-center water consumption is 1.8L per kWh of compute energy, dominated by evaporative cooling for heat rejection. In hot, water-stressed regions (Phoenix, Singapore, parts of Texas) it climbs to 3-5L per kWh. In cool, water- abundant regions (Iceland, Pacific Northwest, parts of Scandinavia) it drops below 0.5L. Siting matters.
Datacenter water use · global average
Across major hyperscalers and colocation operators, average water consumption is 1.8L per kWh. Dominant share is evaporative cooling for heat rejection. Closed-loop cooling reduces this dramatically but adds capex.
Global averagePhoenix, Texas, Singapore
Air-cooled regions in hot climates. The thermodynamics of evaporative cooling are unfavorable; water consumption rises to 3-5× the average. Regional water-stress concerns and local regulation are now real siting constraints.
High-stress sitingIceland, PNW, Scandinavia
Cool climates with abundant water. Closed-loop cooling more effective; less evaporation needed for heat rejection. AI siting in these regions has accelerated meaningfully in 2025-2026 as the water question becomes a corporate-procurement concern.
Low-impact siting"For high-water-stress regions, datacenter water use is becoming the binding sustainability constraint — bigger than electricity by 2027 in some siting decisions."— Internal stakeholder briefing, May 2026
04 — Lifecycle SplitTraining vs inference — the flip happened in 2025.
The 2023-2024 narrative was that training dominated AI's energy footprint. That was true at the time — large frontier models took 50-200 GWh to train, and inference deployment volumes were orders of magnitude lower than they are now. By 2025, the math flipped: deployment volume scaled 3-5× per year, training spend scaled ~1.5× per year, and inference passed training as a share of lifecycle energy.
2023-2024 — training dominates
Frontier model training runs at 50-200 GWh per major model. Inference deployment volume was 100M-1B queries/day for the largest deployments. Lifecycle split: roughly 70% training / 30% inference. Public sustainability discussion focused on training cost.
Training-dominant era2025 — the flip
Deployment volume grew 3-5× through ChatGPT mass adoption, Copilot rollout, AI features in major SaaS, and the agentic-AI wave. Training spend grew ~1.5×. By Q3 2025, inference passed training in lifecycle share for the first time.
Crossover year2026 — inference dominates
Estimated lifecycle split: ~63% inference / ~37% training. Per-query optimization (KV cache, MoE compression, FP8 quantization) is now the primary lever for sustainability gains. Training-cost reduction is still important but less load-bearing on absolute energy.
Inference-dominant era2027-2028 — projection
If deployment volume continues growing at 2-3× per year and training spend at 1.3-1.5×, the inference share will climb to 75-80% of lifecycle energy by end of 2027. Per-query optimization becomes overwhelmingly the dominant sustainability lever.
Forecast05 — TrendThe trend line — decoupled per-query, growing absolute.
Two trend lines tell the 2024-2026 story. Per-query energy is slightly down (-15% over two years, despite ~2× capability improvement) thanks to MoE and KV optimization. Absolute energy is up significantly (+38% YoY) because deployment volume is growing faster than per-query efficiency. Both numbers are simultaneously true; the framing depends on which one matters for the question being asked.
"AI is using more energy than ever — and each query is using less. Both true. Pick the right number for the question."— Stanford AI Index 2026, environmental section
06 — Practical ActionsWhat teams can actually do.
- Use reasoning mode selectively. Reasoning traces (Deep Think, GPT-5.5 Pro Thinking) cost 5-12× the energy of standard inference. Default to non-reasoning; opt into reasoning for tasks that genuinely require it (math, complex agent planning, ambiguous reasoning over data).
- Cache aggressively at long context. An 800K cached call costs ~10% of an uncached one. For long-context workloads, cache topology is also the biggest sustainability lever the deployment owns. The cost gain and the energy gain point the same direction.
- Trim output amplification. A 200K input that elicits 10K of output uses dramatically more energy than the same input eliciting 500 output tokens. Set explicit
max_tokensbudgets and output-shape system prompts to reduce per-call output by 60-80% on long-context workloads. - Pick smaller models when capability allows.A 70B-class open-weight model running at FP8 uses dramatically less energy per query than a 1.6T MoE model. For workloads where capability isn't the bottleneck — summarization, classification, routing — use smaller models. The energy savings compound at deployment volume.
- Prefer providers with low-carbon grid mix and closed-loop cooling. Pacific Northwest, Iceland, Quebec, Scandinavia. The grid-mix difference between coal- heavy and renewable-heavy regions is 5-10× on emissions per query. Where data-residency allows, pick low-carbon regions.
07 — ConclusionThe sustainability frontier moved.
Per-query is decoupling from capability. Absolute is still climbing. Both matter.
By April 2026 the AI sustainability picture is more nuanced than it has ever been. The good news: architectural compression (MoE, MLA, FP8) has decoupled per-query energy from capability improvement — each generation does more for less per call. The harder news: deployment volume is outpacing per-query efficiency by enough that absolute energy is still climbing 38% YoY. Both numbers are simultaneously true, and the framing depends on the question being asked.
The sustainability lever has moved with the lifecycle split. In 2023-2024, training-cost optimization was where the marginal sustainability gains lived. In 2026, inference per-query optimization is the dominant lever — KV-cache discipline, output trim, model-right-sizing, low-carbon-grid siting. These are practical, deployable choices that map onto teams' normal cost-and-quality decisions.
For agency and product teams, the practical takeaway is to measure per-query energy as a first-class metric alongside cost-per-answer and latency. The same operational discipline that produces the cheapest deployment usually produces the most sustainable one — and the alignment between cost and sustainability is one of the field's most important structural truths.