AI DevelopmentNew Release10 min readPublished June 16, 2026

The open model that landed at the frontier’s edge.

GLM-5.2 benchmarks: an open model at the frontier’s edge

Three days after GLM-5.2 shipped to the GLM Coding Plan without a single benchmark, the proof has landed: a full scorecard, MIT-licensed open weights, the standalone API, and the model card are all live. The verdict is striking for an open model — #2 on Arena’s Code Arena Frontend board behind only Fable 5, ahead of Claude Opus 4.7 and 4.8 in thinking mode, and within a point of Opus 4.8 on agentic coding — at the same low GLM-5.1 pricing. Here is what the numbers actually show, and where the frontier still leads.

DA
Digital Applied Team
Senior strategists · Published June 16, 2026
PublishedJune 16, 2026
Read time10 min
Sources7
Code Arena: Frontend
#2
1,595 Elo — behind only Fable 5
Best open model
Terminal-Bench 2.1
81.0
GLM-5.1 scored 63.5
vs Opus 4.8: 85.0
Open weights
MIT
753B MoE on Hugging Face
Live now
API price
$1.40
/M in · $4.40 out · $0.26 cached
Same as GLM-5.1

GLM-5.2 benchmarks are finally here. On June 16, 2026 — three days after Z.ai pushed the model to its GLM Coding Plan with nothing but adjectives — the company published a full scorecard, opened the weights under the MIT License, and turned on the standalone API and chatbot. The headline a benchmark-less launch could only gesture at is now measured: GLM-5.2 is the strongest open-weight coding model available, and on several agentic benchmarks it trades blows with Claude Opus 4.8.

The single most quotable result comes from outside Z.ai. On Arena.ai’s Code Arena Frontend leaderboard — a human-preference board, not a self-report — GLM-5.2 (Max) ranks #2, behind only Anthropic’s Fable 5 and ahead of Claude Opus 4.7 and Opus 4.8 in thinking mode. An MIT-licensed model you can download today is beating two of the closed frontier’s flagships on frontend coding, as judged by developers.

This post reads the numbers honestly: the Code Arena placement, the agentic and long-horizon coding benchmarks, the full cross-vendor table, the open-weight architecture, the economics, and a clear verdict on where GLM-5.2 wins and where Opus 4.8 still leads. For how the launch itself was sequenced, see our GLM-5.2 launch-day coverage; for the model it improves on, our GLM-5.1 benchmark analysis.

Key takeaways
  1. 01
    #2 on Code Arena Frontend, above Opus 4.7 and 4.8.On Arena.ai's human-preference Code Arena Frontend board, GLM-5.2 (Max) scores 1,595 Elo for second place — behind only Fable 5 (1,654, which Arena notes is not currently being sampled) and +29 over Opus 4.7 Thinking (1,566), with Opus 4.8 Thinking at 1,561. It is the top-ranked open model on that board by a wide margin.
  2. 02
    It trades blows with Opus 4.8 on agentic coding.GLM-5.2 hits 81.0 on Terminal-Bench 2.1 (vs GLM-5.1's 63.5 and Opus 4.8's 85.0) and 74.4 on FrontierSWE — within ~1% of Opus 4.8's 75.1, ahead of GPT-5.5. It is the strongest open model on every long-horizon coding benchmark Z.ai reported.
  3. 03
    Opus 4.8 still leads the hardest long-horizon work.On SWE-Marathon (ultra-long tasks) GLM-5.2 scores 13.0 to Opus 4.8's 26.0, and on NL2Repo 48.9 to Opus 4.8's 69.7. The frontier gap is real on the most demanding agentic benchmarks, even as GLM-5.2 stays the top open option.
  4. 04
    Open weights are live under MIT, with a real 1M context.The 753B-parameter MoE (256 experts, 8 active per token) is downloadable now on Hugging Face and ModelScope under MIT — more permissive than GLM-5's Apache-2.0. The 1,048,576-token window is trained on long coding-agent trajectories, with an IndexShare design cutting per-token FLOPs 2.9x at 1M.
  5. 05
    Same price as GLM-5.1 — that is the strategic story.API pricing is unchanged from GLM-5.1: $1.40 per million input tokens, $4.40 output, $0.26 cached — a fraction of closed-frontier list rates. A near-frontier coding model that is open-weight and cheap reshapes the build-vs-buy math for teams running coding agents at scale.

01The UpdateFrom a benchmark-less launch to a full release.

When GLM-5.2 first appeared on June 13, it was a distribution-first launch: live inside the GLM Coding Plan, but with no API, no chatbot, no public weights, and — conspicuously — no benchmarks. We covered it as exactly that, a coding-plan rollout shipped “before the scoreboard.” Three days later the scoreboard arrived, and it is substantial. Everything Z.ai deferred is now shipped.

The model card is public, the architecture paper is out, the standalone API is priced and live, the Z.ai chatbot hosts GLM-5.2, the MIT-licensed weights are on Hugging Face and ModelScope, and a full cross-vendor benchmark table is published. Independent corroboration showed up too: Arena.ai’s Code Arena ranked GLM-5.2 against the field, and three external labs ran the long-horizon coding benchmarks. The qualifier we attached to every day-one claim — “unverified” — can now be lifted on most of them.

What changed between June 13 and June 16

On launch day the honest position was that GLM-5.2’s measured performance was unknown. That is no longer true. The benchmarks, weights, API, and chatbot all landed within three days, and the results are strong enough that GLM-5.2 has moved from “promising, untested” to a genuine option on any open-weight coding shortlist. The numbers in the rest of this post are sourced from Z.ai’s GLM-5.2 tech blog, its Hugging Face model card, and Arena.ai’s public leaderboard.

02Independent Signal#2 on Code Arena Frontend — the developers’ verdict.

Self-reported benchmarks invite skepticism, so the most valuable GLM-5.2 result is the one Z.ai did not run. Arena.ai’s Code Arena Frontend leaderboard ranks models by blind, head-to-head human preference on real frontend coding tasks. There, GLM-5.2 (Max) lands at #2 with an Elo of 1,595 — behind only Fable 5 and ahead of every other model on the board, including Claude Opus 4.7 (Thinking) at 1,566 and Opus 4.8 (Thinking) at 1,561. Arena puts the gain at +29 points over Opus 4.7 Thinking.

For context, GLM-5.1 sits at #9 on the same board (1,531), so the jump from one generation to the next is large. Arena also reports GLM-5.2 as #2 on the React sub-board and #4 on HTML, and as the top-ranked model in several frontend categories — brand and marketing, reference-based design, data and analytics, consumer product, gaming, and simulations. It is, by Arena’s framing, the best open model on the board by a wide margin over Kimi K2.6 and MiniMax M3.

Code Arena: Frontend — top models by human-preference Elo

Source: Arena.ai Code Arena — Frontend/WebDev leaderboard, June 16, 2026 (Elo rating; bars scaled from a 1,450 baseline to match Arena's own chart). Arena notes Fable 5 is not currently being sampled.
Claude Fable 5 (High)Anthropic · #1 · not currently sampled, per Arena
1,654
GLM-5.2 (Max)Z.ai · #2 · MIT open weights
1,595
Claude Opus 4.7 (Thinking)Anthropic · #3
1,566
Claude Opus 4.8 (Thinking)Anthropic · #4
1,561
GLM-5.1Z.ai · #9 · previous flagship
1,531
Why a human-preference board carries weight

A leaderboard built on blind pairwise human votes is hard to game and independent of the vendor — which is exactly why GLM-5.2’s #2 placement matters more than any number on Z.ai’s own slide. It does not prove GLM-5.2 beats Opus 4.8 everywhere (the agentic benchmarks below show where it does not), but it is strong evidence that for frontend generation, developers prefer its output to most of the closed frontier.

03Agentic CodingThe long-horizon coding benchmarks.

GLM-5.2’s pitch is long-horizon agentic work — staying coherent across hours of multi-step coding rather than nailing a single completion — so the benchmarks that matter most are the agentic ones. Three were run by external labs (Proximal, PostTrainBench, and Abundant AI) at 1M context and Max effort, which adds credibility. The pattern is consistent: GLM-5.2 is the top open model on every one, beats GPT-5.5 on most, and sits just below Opus 4.8 — with the gap widening only on the very hardest tasks.

Terminal-Bench 2.1
Strongest open model
81.0

On the Terminus-2 harness, 81.0 versus GLM-5.1's 63.5 — a 17.5-point generational jump. Within four points of Opus 4.8 (85.0) and ahead of Gemini 3.1 Pro (74.0). On its best-reported harness, GLM-5.2 reaches 82.7 in Claude Code.

Opus 4.8: 85.0
FrontierSWE
A statistical tie with Opus 4.8
74.4

Dominance score (as of June 16): 74.4 against Opus 4.8's 75.1 — roughly a 1% gap — while edging GPT-5.5 (72.6) and clearing Opus 4.7 by about 11 points. Measures open-ended projects at the scale of hours to tens of hours.

Opus 4.8: 75.1
PostTrainBench
Second only to Opus 4.8
34.3

Each agent gets an H100 and is scored on how much it improves a small model through post-training. GLM-5.2 (34.3) beats GPT-5.5 (28.4) and Opus 4.7, trailing only Opus 4.8 (37.2).

beats GPT-5.5
SWE-Marathon
The honest weak spot
13.0

Ultra-long-horizon tasks — building compilers, optimizing kernels, shipping production services. GLM-5.2 (13.0) clears GPT-5.5 (12.0) but trails Opus 4.8 (26.0) by half. Still the top open model here, but the frontier gap is real.

Opus 4.8: 26.0

Read together, these say something specific: GLM-5.2 has closed most of the distance to the closed frontier on mainstream agentic coding, but Opus 4.8 still pulls clear on the most demanding, longest tasks. That is a meaningful improvement over GLM-5.1, which our earlier analysis measured at roughly 94.6% of Opus 4.6’s coding score — GLM-5.2 is now comparing itself to Opus 4.8, not 4.6, and holding its own.

04The DataThe full benchmark table.

Z.ai published a wide cross-vendor table spanning reasoning, coding, and agentic tool use. The selection below pulls the load-bearing rows and places GLM-5.2 against its predecessor and the three closed frontier models most teams weigh. Across coding, GLM-5.2 is also the strongest open model versus Qwen 3.7 Max, MiniMax M3, and DeepSeek-V4-Pro (not shown). Higher is better in every row.

GLM-5.2 benchmark scores compared with GLM-5.1, Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro across reasoning, coding, and agentic tasks
BenchmarkGLM-5.2GLM-5.1Opus 4.8GPT-5.5Gemini 3.1 Pro
Agentic & terminal coding
Terminal-Bench 2.1 (Terminus-2)81.063.585.084.074.0
SWE-bench Pro62.158.469.258.654.2
FrontierSWE (dominance)74.430.575.172.639.6
NL2Repo48.942.769.750.733.4
SWE-Marathon13.01.026.012.04.0
Reasoning & tool use
AIME 202699.295.395.798.398.2
GPQA-Diamond91.286.293.693.694.3
MCP-Atlas (public)76.871.877.875.369.2

Source: Z.ai GLM-5.2 tech blog (z.ai/blog/glm-5.2), June 16, 2026. Scores are vendor-reported except where run by external labs (FrontierSWE by Proximal, PostTrainBench, SWE-Marathon by Abundant AI), each at 1M context and Max effort.

A fair-reading caveat

These figures are Z.ai’s own table (apart from the externally-run agentic benchmarks and Arena’s independent board). Vendor tables pick favourable harnesses and settings, so treat the closed-model columns as Z.ai’s measurement of them, not necessarily each vendor’s best result. The robust conclusions are the ones the independent signals also support: GLM-5.2 is the top open coder, and it is frontier-adjacent rather than frontier-beating.

05ArchitectureOpen weights, the architecture, and a solid 1M context.

Unlike launch day, the weights are real now. GLM-5.2 is published on Hugging Face and ModelScope under the MIT License — the most permissive mainstream open license, and a step looser than the Apache-2.0 that GLM-5 shipped under. Z.ai frames it as “pure open”: no regional limits, no usage borders. The model card and config confirm the shape below, and our GLM-5 architecture analysis covers the lineage this builds on.

Parameters
256 experts, 8 active per token
753BMoE

A 753B-parameter mixture-of-experts model: 256 routed experts plus one shared, 8 active per token across 78 layers (the first 3 dense). Slightly larger than GLM-5's 744B base. Weights ship in BF16.

Hugging Face config
Context
Trained to be usable, not just wide
1Mtokens

A 1,048,576-token window, expanded from the 200K of the prior line and trained on long coding-agent trajectories. IndexShare reuses one sparse-attention indexer every four layers, cutting per-token FLOPs 2.9x at 1M length.

glm-5.2[1m] in Claude Code
License
Downloadable today
MIT

Live on Hugging Face and ModelScope under MIT. Runs on transformers, vLLM, SGLang, xLLM, and KTransformers. An improved MTP layer lifts speculative-decoding acceptance length by up to 20%.

vs GLM-5's Apache-2.0
Effort
High and Max
2levels

Two thinking-effort levels; the API defaults to Max, which Z.ai recommends for coding. In Claude Code, low/medium/high map to High and xhigh/max/ultracode map to Max — so deepest reasoning is opt-in.

Max = deepest
The model id, precisely

In Z.ai’s API the model id is simply glm-5.2, and it already carries the 1M context. The glm-5.2[1m] form is Claude Code’s own naming convention for addressing the 1M window — Z.ai’s own instructions tell Claude Code users to set GLM-5.2[1m] to enable it. Both are correct in their place; there is no separate “1m” model on the API side. Our GLM-4.7 setup guide walks through the provider configuration this rides on.

06EconomicsThe part that actually moves decisions: price.

Benchmarks decide whether a model is good enough; economics decide whether you switch. GLM-5.2’s most consequential detail is that it costs exactly what GLM-5.1 did — $1.40 per million input tokens, $4.40 output, $0.26 cached — while delivering a generation’s worth of improvement. A near-frontier coding model at that price, with an open-weight fallback, is what reshapes the build-versus-buy calculation. There are three ways to run it, each with a different cost shape.

API
The pay-per-token path
$1.40 in · $4.40 out / M

Z.ai's standalone API, now live, prices GLM-5.2 identically to GLM-5.1, with cached input at $0.26/M. A fraction of closed-frontier list rates — the natural entry point for non-subscribers who skipped the day-one Coding Plan.

Same as GLM-5.1
Coding Plan
The metered-prompt path
From $18 / month

The subscription that meters prompts (15-20 model calls each) across Lite, Pro, Max, and Team. GLM-5.2 draws 3x quota at peak and 2x off-peak — but a promo bills off-peak at 1x through end of September. Works in Claude Code, Cline, OpenCode, and Z.ai's new ZCode.

Peak: 14:00-18:00 UTC+8
Self-host
The MIT open-weight path
Your hardware · $0 / token

The 753B weights on Hugging Face and ModelScope run under transformers, vLLM, or SGLang. No per-token meter and no code leaving your perimeter — at the cost of serving a large MoE yourself. For most teams a compliance lever, not the default.

Data-sovereignty option
The strategic read

The closed frontier still wins the hardest long-horizon benchmarks, but it now has to justify a large price premium against an open model that ranks #2 on a human-preference coding board and ties it on FrontierSWE. For a team running coding agents at volume, “90-95% of frontier capability at a fraction of the cost, with a self-host escape hatch” is a stronger pitch than another point of benchmark score. That is the lever Chinese open-weight labs keep pulling, and GLM-5.2 pulls it harder than any of its predecessors.

07The VerdictWhere GLM-5.2 wins — and where it does not.

GLM-5.2 is the strongest open-weight coding model available in June 2026, and frontier-adjacent on agentic work — but it is not a frontier-beater across the board, and the honest matrix says so. Here is how it stacks against the realistic alternatives a team would weigh this month, now scored on confirmed numbers rather than launch-day adjectives.

GLM-5.2
The open-weight value pick

#2 on Code Arena Frontend, 81.0 on Terminal-Bench 2.1, FrontierSWE within ~1% of Opus 4.8 — at $1.40/$4.40 per M and MIT weights you can self-host. Weakest on the very hardest long-horizon tasks (SWE-Marathon, NL2Repo). Pick when: you want near-frontier coding at a fraction of the cost, an open-weight fallback, or both.

Best value + open
Claude Opus 4.8
The closed capability ceiling

Still leads the hardest agentic benchmarks — SWE-bench Pro (69.2), NL2Repo (69.7), SWE-Marathon (26.0) — with the deepest Claude Code harness ecosystem and a 1M context. Closed weights, premium pricing. Pick when: maximum capability on the longest, hardest tasks outweighs cost and weights access.

Capability ceiling
Kimi K2.7-Code
The other open coding specialist

Moonshot's coding-focused open-weight model under Modified MIT, with token-efficiency gains and the open Kimi Code CLI from $19/mo. A direct open-weight alternative to GLM-5.2 worth A/B-ing on your own repos. Pick when: you want to compare open coders head-to-head before committing.

Open-weight rival
Qwen 3.7 Max
The closed benchmark contender

Alibaba's flagship with a 1M context and competitive coding scores (SWE-bench Pro 60.6), at mid-tier pricing. Closed weights — no self-hosting. GLM-5.2 edges it on most coding rows and beats it decisively on Code Arena. Pick when: you are already in the Alibaba ecosystem and want a closed option.

Closed alternative

08Action GuideWhat dev teams should do now.

Run your own evaluation — the cost is now near zero. With the API live at GLM-5.1 pricing and weights on Hugging Face, there is no longer a reason to take any benchmark on faith. Point a coding agent at glm-5.2 (or GLM-5.2[1m] in Claude Code), set Max effort, and run one real multi-file refactor or failing-test fix on a branch. Score end-to-end completion without intervention — that is the long-horizon claim, tested directly.

Match the model to the task tier. The benchmarks suggest a clean split: GLM-5.2 for the bulk of day-to-day agentic coding, where it is frontier-adjacent and far cheaper, and reserve Opus 4.8 for the hardest, longest tasks where it still pulls clear (large-scale refactors, multi-hour autonomous runs). A two-model routing setup captures most of the savings without giving up the ceiling.

Treat the open weights as a real lever now, not a someday. Unlike launch day, the MIT weights are downloadable, so a self-hosting or data-residency plan can actually be scoped. For regulated workloads it is a genuine path; for most teams it is a negotiating lever and a compliance fallback. Our AI transformation services cover model selection and deployment, and our web development team ships with these agentic stacks daily.

Re-check the leaderboards in a fortnight. Arena’s Elo will keep moving as vote volume grows, and independent SWE-bench and Terminal-Bench re-runs typically firm up within one to two weeks of an open-weight release. The picture here is strong and unusually well-corroborated for a day-of report, but the open weights mean the community will pressure-test it fast — let the standardize-on-it decision follow that.

Conclusion

The open frontier just got a lot harder to ignore.

Three days ago GLM-5.2 was a flagship with no scoreboard. Now it is the best open-weight coding model on the board, #2 on a human-preference frontend leaderboard above two Claude Opus flagships, within a point of Opus 4.8 on FrontierSWE, and shipping under MIT at unchanged GLM-5.1 prices. The launch-day caution has largely paid off: the family that earned its credibility with GLM-5 backed up the adjectives with numbers.

The honest qualifier is still worth stating. Opus 4.8 keeps the lead on the hardest, longest agentic tasks, and Z.ai’s table is its own measurement of rivals. But the robust, independently-supported conclusion is that an MIT-licensed model you can download today now does most of what the closed frontier does, for a fraction of the price. For any team running coding agents at scale, that is no longer a model to watch — it is one to test this week.

Build with the right AI stack

From benchmark to production workflow.

We help teams evaluate, integrate, and operationalize AI coding stacks — from model selection and token-economics modeling to agentic workflow design and self-hosting strategy.

Free consultationExpert guidanceTailored solutions
What we work on

AI development & agentic workflows

  • AI coding stack evaluation and rollout
  • Token-economics and per-seat cost modeling
  • Multi-model routing and agentic workflow design
  • Open-weight self-hosting strategy
  • Developer team enablement
FAQ · GLM-5.2 benchmarks

The questions teams ask about GLM-5.2’s numbers.

GLM-5.2 is frontier-adjacent rather than frontier-beating. It ranks #2 on Arena's Code Arena Frontend board — above Opus 4.7 and 4.8 in thinking mode — and ties Opus 4.8 on FrontierSWE (74.4 vs 75.1). But Opus 4.8 still leads the hardest agentic benchmarks: SWE-bench Pro (69.2 vs 62.1), NL2Repo (69.7 vs 48.9), and SWE-Marathon (26.0 vs 13.0). The practical summary: GLM-5.2 matches or beats Opus 4.8 on mainstream and frontend coding, while Opus 4.8 pulls clear on the longest, most demanding tasks.