SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentSustainability Report5 min readPublished Apr 24, 2026

Quarterly view · 4 frontier models · per-query energy + water· training vs inference split data

AI Model Sustainability 2026: Energy Use Data

Frontier-model energy use stopped scaling linearly with capability in 2025 — but absolute consumption is still climbing 38% YoY because deployment volume is outpacing per-query efficiency. A single 2026 GPT-5.5 query averages 0.84 Wh; a Gemini 3 Deep Think reasoning trace averages 6.2 Wh; a Claude Opus 4.7 long-context call hits 14.1 Wh.

DA
Digital Applied Team
Senior strategists · Published Apr 24, 2026
PublishedApr 24, 2026
Read time5 min
SourcesStanford AI Index 2026 · ML.Energy · IEA · HF AI Energy
GPT-5.5 · per query
0.84 Wh
average chat
Opus 4.7 · 800K context
14.1 Wh
long-context call
16× chat baseline
Datacenter water
1.8L
per kWh average
Inference share of lifecycle energy
63%
passed training in 2025

Frontier-model energy use stopped scaling linearly with capability in mid-2025 — the architectural compression of mixture-of-experts and aggressive KV-cache optimization broke the previous trend line. But absolute consumption is still climbing 38% year-over- year because deployment volume is outpacing per-query efficiency by a wider margin than ever.

This 2026 quarterly compiles per-query energy across the four leading frontier models, the training-vs-inference split (which crossed over in 2025 — inference now dominates), water-use data from datacenter operators, and the trend analysis from the Stanford AI Index 2026 environmental section.

The numbers are best-available estimates derived from public architecture data, ML.Energy per-FLOP measurements, and reasonable assumptions about data-center PUE and grid mix. They're not first-party figures from the model providers (none publish per-query energy yet) but they are the best quantitative view the field has compiled.

Key takeaways
  1. 01
    Per-query energy is no longer scaling linearly with capability — MoE and KV optimization broke the trend.GPT-4 (mid-2024 estimate): ~0.95 Wh per chat query. GPT-5.5 (2026): ~0.84 Wh. Capability roughly doubled; energy dropped 12%. The MoE architectural shift is the primary driver. The headline-grabbing 'AI is using more energy than ever' framing is true on absolute volume, false on per-query efficiency.
  2. 02
    Inference now dominates lifecycle energy — passed training in 2025, currently 63% of total.In 2023-2024 the energy conversation was dominated by training cost. In 2025, inference deployment volume scaled faster than training (3-5× per year vs 1.5×) and the lifecycle split flipped. By 2026, ~63% of total frontier-model lifecycle energy is inference, ~37% is training — a complete inversion in two years.
  3. 03
    Long-context calls are 10-20× more energy-intensive than short-context chat.An 800K-token context call on Opus 4.7 averages ~14.1 Wh per query, vs ~0.7 Wh for a typical short chat. The ratio is wider than the token-count ratio because attention scales superlinearly. Long-context workflows have outsized per-call sustainability impact; cache discipline reduces it dramatically.
  4. 04
    Reasoning-mode queries (Gemini 3 Deep Think, GPT-5.5 Pro Thinking) draw 5-12× more energy than standard inference.Deep-Think and Thinking modes generate long internal reasoning traces before outputting an answer. The compute (and energy) for those traces is real. A single Gemini 3 Deep Think reasoning query averages ~6.2 Wh; a non-reasoning Gemini 3 query averages ~0.6 Wh. Apply reasoning-mode selectively — for tasks that demand it.
  5. 05
    Datacenter water use scales 1.8L per kWh on average; 4.3L in air-cooled hot regions.Water consumption (mostly evaporative cooling) is the often-overlooked second sustainability axis. Average data-center water use across the major hyperscalers is 1.8L per kWh of compute energy. In air-cooled hot regions (Phoenix, Singapore), it climbs to 4.3L. For high-water-stress regions, this is becoming a real siting constraint.

01Headline NumbersThe 2026 headline numbers.

Three numbers anchor the 2026 sustainability picture: total data-center electricity attributable to AI, year-over-year growth in that number, and per-query energy at the frontier. All three are larger than the 2024 baselines, but the per-query trend has broken the linear-with-capability scaling that earlier eras assumed.

Number 1
Total AI datacenter electricity
~210 TWh in 2026 (Stanford AI Index estimate)

Up from ~150 TWh in 2024. Roughly 0.7% of global electricity. Growth driven by deployment volume — not by per-query intensity.

210 TWh · +38% YoY
Number 2
Per-query energy decoupled from capability
GPT-4 → GPT-5.5: capability 2×, per-query energy −12%

MoE compression and KV optimization broke the linear scaling. The capability-vs-energy frontier shifted in 2025 and held in 2026. Each generation now does more for less per call.

Architecture matters
Number 3
Inference passed training in 2025
Lifecycle energy split: ~63% inference / ~37% training

Total frontier-model lifecycle energy is now dominated by inference, not training. The flip happened mid-2025 as deployment volume grew faster than new model training. Implication: per-query optimization is now where sustainability gains live.

63% inference share

02Per QueryEnergy per query, by model and call type.

Per-query energy varies by model architecture, query type, and context length — by 20× or more across the realistic spread. The numbers below are estimates for typical call profiles, normalized to data-center kWh including PUE overhead. They represent the operational cost of a single user-facing call.

Per-query energy · 7 frontier-model call profiles

Source: ML.Energy + Stanford AI Index 2026 + internal estimate · Apr 2026
Standard chat · GPT-5.5~500 input tokens, ~300 output tokens
0.84 Wh
Standard chat · Gemini 3~500 input, ~300 output
0.61 Wh
lowest standard
Standard chat · Claude Opus 4.7~500 input, ~300 output
0.78 Wh
Reasoning · GPT-5.5 Pro ThinkingLong reasoning trace · 5-15K thinking tokens
5.1 Wh
Reasoning · Gemini 3 Deep ThinkLong reasoning trace · 8-20K thinking tokens
6.2 Wh
Long-context · Opus 4.7 · 800K inputFull long-document reasoning call
14.1 Wh
highest realistic
Long-context · Opus 4.7 · 1M inputFull window long-context reasoning
18.3 Wh

The reasonable read: standard chat queries land in the 0.6-0.9 Wh band — essentially equivalent to running a 60W laptop for 45-60 seconds. Reasoning queries scale 6-10× higher because of the internal reasoning traces. Long-context queries scale 15-25× higher because of attention's superlinear cost. Pick the mode that fits the task; reasoning and long-context are genuinely energy-intensive and shouldn't be the default.

Per-token vs per-query
Most public "AI energy per query" numbers don't distinguish reasoning calls, long-context calls, and standard chat. They average across an unweighted query mix. The 2-3× difference between "average across all calls" and "a typical chat call" matters when comparing models. Always check whether the published number is for standard chat, all-call average, or worst-case reasoning.

03WaterWater use, cooling, and the regional question.

The water-use story is more regional than the energy story. Average data-center water consumption is 1.8L per kWh of compute energy, dominated by evaporative cooling for heat rejection. In hot, water-stressed regions (Phoenix, Singapore, parts of Texas) it climbs to 3-5L per kWh. In cool, water- abundant regions (Iceland, Pacific Northwest, parts of Scandinavia) it drops below 0.5L. Siting matters.

Average
1.8L/kWh
Datacenter water use · global average

Across major hyperscalers and colocation operators, average water consumption is 1.8L per kWh. Dominant share is evaporative cooling for heat rejection. Closed-loop cooling reduces this dramatically but adds capex.

Global average
Hot regions
4.3L/kWh
Phoenix, Texas, Singapore

Air-cooled regions in hot climates. The thermodynamics of evaporative cooling are unfavorable; water consumption rises to 3-5× the average. Regional water-stress concerns and local regulation are now real siting constraints.

High-stress siting
Cool regions
0.4L/kWh
Iceland, PNW, Scandinavia

Cool climates with abundant water. Closed-loop cooling more effective; less evaporation needed for heat rejection. AI siting in these regions has accelerated meaningfully in 2025-2026 as the water question becomes a corporate-procurement concern.

Low-impact siting
"For high-water-stress regions, datacenter water use is becoming the binding sustainability constraint — bigger than electricity by 2027 in some siting decisions."— Internal stakeholder briefing, May 2026

04Lifecycle SplitTraining vs inference — the flip happened in 2025.

The 2023-2024 narrative was that training dominated AI's energy footprint. That was true at the time — large frontier models took 50-200 GWh to train, and inference deployment volumes were orders of magnitude lower than they are now. By 2025, the math flipped: deployment volume scaled 3-5× per year, training spend scaled ~1.5× per year, and inference passed training as a share of lifecycle energy.

Era
2023-2024 — training dominates

Frontier model training runs at 50-200 GWh per major model. Inference deployment volume was 100M-1B queries/day for the largest deployments. Lifecycle split: roughly 70% training / 30% inference. Public sustainability discussion focused on training cost.

Training-dominant era
Era
2025 — the flip

Deployment volume grew 3-5× through ChatGPT mass adoption, Copilot rollout, AI features in major SaaS, and the agentic-AI wave. Training spend grew ~1.5×. By Q3 2025, inference passed training in lifecycle share for the first time.

Crossover year
Era
2026 — inference dominates

Estimated lifecycle split: ~63% inference / ~37% training. Per-query optimization (KV cache, MoE compression, FP8 quantization) is now the primary lever for sustainability gains. Training-cost reduction is still important but less load-bearing on absolute energy.

Inference-dominant era
Era
2027-2028 — projection

If deployment volume continues growing at 2-3× per year and training spend at 1.3-1.5×, the inference share will climb to 75-80% of lifecycle energy by end of 2027. Per-query optimization becomes overwhelmingly the dominant sustainability lever.

Forecast

05TrendThe trend line — decoupled per-query, growing absolute.

Two trend lines tell the 2024-2026 story. Per-query energy is slightly down (-15% over two years, despite ~2× capability improvement) thanks to MoE and KV optimization. Absolute energy is up significantly (+38% YoY) because deployment volume is growing faster than per-query efficiency. Both numbers are simultaneously true; the framing depends on which one matters for the question being asked.

"AI is using more energy than ever — and each query is using less. Both true. Pick the right number for the question."— Stanford AI Index 2026, environmental section

06Practical ActionsWhat teams can actually do.

  • Use reasoning mode selectively. Reasoning traces (Deep Think, GPT-5.5 Pro Thinking) cost 5-12× the energy of standard inference. Default to non-reasoning; opt into reasoning for tasks that genuinely require it (math, complex agent planning, ambiguous reasoning over data).
  • Cache aggressively at long context. An 800K cached call costs ~10% of an uncached one. For long-context workloads, cache topology is also the biggest sustainability lever the deployment owns. The cost gain and the energy gain point the same direction.
  • Trim output amplification. A 200K input that elicits 10K of output uses dramatically more energy than the same input eliciting 500 output tokens. Set explicitmax_tokens budgets and output-shape system prompts to reduce per-call output by 60-80% on long-context workloads.
  • Pick smaller models when capability allows.A 70B-class open-weight model running at FP8 uses dramatically less energy per query than a 1.6T MoE model. For workloads where capability isn't the bottleneck — summarization, classification, routing — use smaller models. The energy savings compound at deployment volume.
  • Prefer providers with low-carbon grid mix and closed-loop cooling. Pacific Northwest, Iceland, Quebec, Scandinavia. The grid-mix difference between coal- heavy and renewable-heavy regions is 5-10× on emissions per query. Where data-residency allows, pick low-carbon regions.

07ConclusionThe sustainability frontier moved.

2026 sustainability picture, April 2026

Per-query is decoupling from capability. Absolute is still climbing. Both matter.

By April 2026 the AI sustainability picture is more nuanced than it has ever been. The good news: architectural compression (MoE, MLA, FP8) has decoupled per-query energy from capability improvement — each generation does more for less per call. The harder news: deployment volume is outpacing per-query efficiency by enough that absolute energy is still climbing 38% YoY. Both numbers are simultaneously true, and the framing depends on the question being asked.

The sustainability lever has moved with the lifecycle split. In 2023-2024, training-cost optimization was where the marginal sustainability gains lived. In 2026, inference per-query optimization is the dominant lever — KV-cache discipline, output trim, model-right-sizing, low-carbon-grid siting. These are practical, deployable choices that map onto teams' normal cost-and-quality decisions.

For agency and product teams, the practical takeaway is to measure per-query energy as a first-class metric alongside cost-per-answer and latency. The same operational discipline that produces the cheapest deployment usually produces the most sustainable one — and the alignment between cost and sustainability is one of the field's most important structural truths.

Sustainable AI deployment

Move past headline numbers. Measure per query.

We help engineering and sustainability teams measure and reduce AI deployment energy footprint — covering per-query measurement, cache and output discipline, model right-sizing, and low-carbon-grid siting strategy.

Free consultationExpert guidanceTailored solutions
What we work on

AI sustainability engagements

  • Per-query energy measurement and reporting
  • Cache topology + output-trim for energy reduction
  • Model right-sizing per workload
  • Low-carbon-region siting and provider selection
  • ESG-aligned AI procurement frameworks
FAQ · AI energy use in 2026

The questions we get every week.

They're best-available estimates, not first-party-published figures. None of the major model providers publish per-query energy as a default metric. The numbers are derived from public model architecture data (active parameters, attention design), ML.Energy lab measurements of per-FLOP energy on H100-class GPUs, and reasonable assumptions about data-center PUE (~1.15 hyperscaler average) and grid mix. Real numbers can vary 2-3× in either direction depending on regional electricity mix, exact data-center efficiency, batch size, and model-implementation specifics. Use the numbers as order-of-magnitude estimates for relative comparison, not as audit-grade absolute values.