GLM-5.2 benchmarks are finally here. On June 16, 2026 — three days after Z.ai pushed the model to its GLM Coding Plan with nothing but adjectives — the company published a full scorecard, opened the weights under the MIT License, and turned on the standalone API and chatbot. The headline a benchmark-less launch could only gesture at is now measured: GLM-5.2 is the strongest open-weight coding model available, and on several agentic benchmarks it trades blows with Claude Opus 4.8.
The single most quotable result comes from outside Z.ai. On Arena.ai’s Code Arena Frontend leaderboard — a human-preference board, not a self-report — GLM-5.2 (Max) ranks #2, behind only Anthropic’s Fable 5 and ahead of Claude Opus 4.7 and Opus 4.8 in thinking mode. An MIT-licensed model you can download today is beating two of the closed frontier’s flagships on frontend coding, as judged by developers.
This post reads the numbers honestly: the Code Arena placement, the agentic and long-horizon coding benchmarks, the full cross-vendor table, the open-weight architecture, the economics, and a clear verdict on where GLM-5.2 wins and where Opus 4.8 still leads. For how the launch itself was sequenced, see our GLM-5.2 launch-day coverage; for the model it improves on, our GLM-5.1 benchmark analysis.
- 01#2 on Code Arena Frontend, above Opus 4.7 and 4.8.On Arena.ai's human-preference Code Arena Frontend board, GLM-5.2 (Max) scores 1,595 Elo for second place — behind only Fable 5 (1,654, which Arena notes is not currently being sampled) and +29 over Opus 4.7 Thinking (1,566), with Opus 4.8 Thinking at 1,561. It is the top-ranked open model on that board by a wide margin.
- 02It trades blows with Opus 4.8 on agentic coding.GLM-5.2 hits 81.0 on Terminal-Bench 2.1 (vs GLM-5.1's 63.5 and Opus 4.8's 85.0) and 74.4 on FrontierSWE — within ~1% of Opus 4.8's 75.1, ahead of GPT-5.5. It is the strongest open model on every long-horizon coding benchmark Z.ai reported.
- 03Opus 4.8 still leads the hardest long-horizon work.On SWE-Marathon (ultra-long tasks) GLM-5.2 scores 13.0 to Opus 4.8's 26.0, and on NL2Repo 48.9 to Opus 4.8's 69.7. The frontier gap is real on the most demanding agentic benchmarks, even as GLM-5.2 stays the top open option.
- 04Open weights are live under MIT, with a real 1M context.The 753B-parameter MoE (256 experts, 8 active per token) is downloadable now on Hugging Face and ModelScope under MIT — more permissive than GLM-5's Apache-2.0. The 1,048,576-token window is trained on long coding-agent trajectories, with an IndexShare design cutting per-token FLOPs 2.9x at 1M.
- 05Same price as GLM-5.1 — that is the strategic story.API pricing is unchanged from GLM-5.1: $1.40 per million input tokens, $4.40 output, $0.26 cached — a fraction of closed-frontier list rates. A near-frontier coding model that is open-weight and cheap reshapes the build-vs-buy math for teams running coding agents at scale.
01 — The UpdateFrom a benchmark-less launch to a full release.
When GLM-5.2 first appeared on June 13, it was a distribution-first launch: live inside the GLM Coding Plan, but with no API, no chatbot, no public weights, and — conspicuously — no benchmarks. We covered it as exactly that, a coding-plan rollout shipped “before the scoreboard.” Three days later the scoreboard arrived, and it is substantial. Everything Z.ai deferred is now shipped.
The model card is public, the architecture paper is out, the standalone API is priced and live, the Z.ai chatbot hosts GLM-5.2, the MIT-licensed weights are on Hugging Face and ModelScope, and a full cross-vendor benchmark table is published. Independent corroboration showed up too: Arena.ai’s Code Arena ranked GLM-5.2 against the field, and three external labs ran the long-horizon coding benchmarks. The qualifier we attached to every day-one claim — “unverified” — can now be lifted on most of them.
On launch day the honest position was that GLM-5.2’s measured performance was unknown. That is no longer true. The benchmarks, weights, API, and chatbot all landed within three days, and the results are strong enough that GLM-5.2 has moved from “promising, untested” to a genuine option on any open-weight coding shortlist. The numbers in the rest of this post are sourced from Z.ai’s GLM-5.2 tech blog, its Hugging Face model card, and Arena.ai’s public leaderboard.
02 — Independent Signal#2 on Code Arena Frontend — the developers’ verdict.
Self-reported benchmarks invite skepticism, so the most valuable GLM-5.2 result is the one Z.ai did not run. Arena.ai’s Code Arena Frontend leaderboard ranks models by blind, head-to-head human preference on real frontend coding tasks. There, GLM-5.2 (Max) lands at #2 with an Elo of 1,595 — behind only Fable 5 and ahead of every other model on the board, including Claude Opus 4.7 (Thinking) at 1,566 and Opus 4.8 (Thinking) at 1,561. Arena puts the gain at +29 points over Opus 4.7 Thinking.
For context, GLM-5.1 sits at #9 on the same board (1,531), so the jump from one generation to the next is large. Arena also reports GLM-5.2 as #2 on the React sub-board and #4 on HTML, and as the top-ranked model in several frontend categories — brand and marketing, reference-based design, data and analytics, consumer product, gaming, and simulations. It is, by Arena’s framing, the best open model on the board by a wide margin over Kimi K2.6 and MiniMax M3.
Code Arena: Frontend — top models by human-preference Elo
Source: Arena.ai Code Arena — Frontend/WebDev leaderboard, June 16, 2026 (Elo rating; bars scaled from a 1,450 baseline to match Arena's own chart). Arena notes Fable 5 is not currently being sampled.A leaderboard built on blind pairwise human votes is hard to game and independent of the vendor — which is exactly why GLM-5.2’s #2 placement matters more than any number on Z.ai’s own slide. It does not prove GLM-5.2 beats Opus 4.8 everywhere (the agentic benchmarks below show where it does not), but it is strong evidence that for frontend generation, developers prefer its output to most of the closed frontier.
03 — Agentic CodingThe long-horizon coding benchmarks.
GLM-5.2’s pitch is long-horizon agentic work — staying coherent across hours of multi-step coding rather than nailing a single completion — so the benchmarks that matter most are the agentic ones. Three were run by external labs (Proximal, PostTrainBench, and Abundant AI) at 1M context and Max effort, which adds credibility. The pattern is consistent: GLM-5.2 is the top open model on every one, beats GPT-5.5 on most, and sits just below Opus 4.8 — with the gap widening only on the very hardest tasks.
Strongest open model
On the Terminus-2 harness, 81.0 versus GLM-5.1's 63.5 — a 17.5-point generational jump. Within four points of Opus 4.8 (85.0) and ahead of Gemini 3.1 Pro (74.0). On its best-reported harness, GLM-5.2 reaches 82.7 in Claude Code.
A statistical tie with Opus 4.8
Dominance score (as of June 16): 74.4 against Opus 4.8's 75.1 — roughly a 1% gap — while edging GPT-5.5 (72.6) and clearing Opus 4.7 by about 11 points. Measures open-ended projects at the scale of hours to tens of hours.
Second only to Opus 4.8
Each agent gets an H100 and is scored on how much it improves a small model through post-training. GLM-5.2 (34.3) beats GPT-5.5 (28.4) and Opus 4.7, trailing only Opus 4.8 (37.2).
The honest weak spot
Ultra-long-horizon tasks — building compilers, optimizing kernels, shipping production services. GLM-5.2 (13.0) clears GPT-5.5 (12.0) but trails Opus 4.8 (26.0) by half. Still the top open model here, but the frontier gap is real.
Read together, these say something specific: GLM-5.2 has closed most of the distance to the closed frontier on mainstream agentic coding, but Opus 4.8 still pulls clear on the most demanding, longest tasks. That is a meaningful improvement over GLM-5.1, which our earlier analysis measured at roughly 94.6% of Opus 4.6’s coding score — GLM-5.2 is now comparing itself to Opus 4.8, not 4.6, and holding its own.
04 — The DataThe full benchmark table.
Z.ai published a wide cross-vendor table spanning reasoning, coding, and agentic tool use. The selection below pulls the load-bearing rows and places GLM-5.2 against its predecessor and the three closed frontier models most teams weigh. Across coding, GLM-5.2 is also the strongest open model versus Qwen 3.7 Max, MiniMax M3, and DeepSeek-V4-Pro (not shown). Higher is better in every row.
| Benchmark | GLM-5.2 | GLM-5.1 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|---|
| Agentic & terminal coding | |||||
| Terminal-Bench 2.1 (Terminus-2) | 81.0 | 63.5 | 85.0 | 84.0 | 74.0 |
| SWE-bench Pro | 62.1 | 58.4 | 69.2 | 58.6 | 54.2 |
| FrontierSWE (dominance) | 74.4 | 30.5 | 75.1 | 72.6 | 39.6 |
| NL2Repo | 48.9 | 42.7 | 69.7 | 50.7 | 33.4 |
| SWE-Marathon | 13.0 | 1.0 | 26.0 | 12.0 | 4.0 |
| Reasoning & tool use | |||||
| AIME 2026 | 99.2 | 95.3 | 95.7 | 98.3 | 98.2 |
| GPQA-Diamond | 91.2 | 86.2 | 93.6 | 93.6 | 94.3 |
| MCP-Atlas (public) | 76.8 | 71.8 | 77.8 | 75.3 | 69.2 |
Source: Z.ai GLM-5.2 tech blog (z.ai/blog/glm-5.2), June 16, 2026. Scores are vendor-reported except where run by external labs (FrontierSWE by Proximal, PostTrainBench, SWE-Marathon by Abundant AI), each at 1M context and Max effort.
These figures are Z.ai’s own table (apart from the externally-run agentic benchmarks and Arena’s independent board). Vendor tables pick favourable harnesses and settings, so treat the closed-model columns as Z.ai’s measurement of them, not necessarily each vendor’s best result. The robust conclusions are the ones the independent signals also support: GLM-5.2 is the top open coder, and it is frontier-adjacent rather than frontier-beating.
05 — ArchitectureOpen weights, the architecture, and a solid 1M context.
Unlike launch day, the weights are real now. GLM-5.2 is published on Hugging Face and ModelScope under the MIT License — the most permissive mainstream open license, and a step looser than the Apache-2.0 that GLM-5 shipped under. Z.ai frames it as “pure open”: no regional limits, no usage borders. The model card and config confirm the shape below, and our GLM-5 architecture analysis covers the lineage this builds on.
256 experts, 8 active per token
A 753B-parameter mixture-of-experts model: 256 routed experts plus one shared, 8 active per token across 78 layers (the first 3 dense). Slightly larger than GLM-5's 744B base. Weights ship in BF16.
Trained to be usable, not just wide
A 1,048,576-token window, expanded from the 200K of the prior line and trained on long coding-agent trajectories. IndexShare reuses one sparse-attention indexer every four layers, cutting per-token FLOPs 2.9x at 1M length.
Downloadable today
Live on Hugging Face and ModelScope under MIT. Runs on transformers, vLLM, SGLang, xLLM, and KTransformers. An improved MTP layer lifts speculative-decoding acceptance length by up to 20%.
High and Max
Two thinking-effort levels; the API defaults to Max, which Z.ai recommends for coding. In Claude Code, low/medium/high map to High and xhigh/max/ultracode map to Max — so deepest reasoning is opt-in.
In Z.ai’s API the model id is simply glm-5.2, and it already carries the 1M context. The glm-5.2[1m] form is Claude Code’s own naming convention for addressing the 1M window — Z.ai’s own instructions tell Claude Code users to set GLM-5.2[1m] to enable it. Both are correct in their place; there is no separate “1m” model on the API side. Our GLM-4.7 setup guide walks through the provider configuration this rides on.
06 — EconomicsThe part that actually moves decisions: price.
Benchmarks decide whether a model is good enough; economics decide whether you switch. GLM-5.2’s most consequential detail is that it costs exactly what GLM-5.1 did — $1.40 per million input tokens, $4.40 output, $0.26 cached — while delivering a generation’s worth of improvement. A near-frontier coding model at that price, with an open-weight fallback, is what reshapes the build-versus-buy calculation. There are three ways to run it, each with a different cost shape.
The pay-per-token path
Z.ai's standalone API, now live, prices GLM-5.2 identically to GLM-5.1, with cached input at $0.26/M. A fraction of closed-frontier list rates — the natural entry point for non-subscribers who skipped the day-one Coding Plan.
The metered-prompt path
The subscription that meters prompts (15-20 model calls each) across Lite, Pro, Max, and Team. GLM-5.2 draws 3x quota at peak and 2x off-peak — but a promo bills off-peak at 1x through end of September. Works in Claude Code, Cline, OpenCode, and Z.ai's new ZCode.
The MIT open-weight path
The 753B weights on Hugging Face and ModelScope run under transformers, vLLM, or SGLang. No per-token meter and no code leaving your perimeter — at the cost of serving a large MoE yourself. For most teams a compliance lever, not the default.
The closed frontier still wins the hardest long-horizon benchmarks, but it now has to justify a large price premium against an open model that ranks #2 on a human-preference coding board and ties it on FrontierSWE. For a team running coding agents at volume, “90-95% of frontier capability at a fraction of the cost, with a self-host escape hatch” is a stronger pitch than another point of benchmark score. That is the lever Chinese open-weight labs keep pulling, and GLM-5.2 pulls it harder than any of its predecessors.
07 — The VerdictWhere GLM-5.2 wins — and where it does not.
GLM-5.2 is the strongest open-weight coding model available in June 2026, and frontier-adjacent on agentic work — but it is not a frontier-beater across the board, and the honest matrix says so. Here is how it stacks against the realistic alternatives a team would weigh this month, now scored on confirmed numbers rather than launch-day adjectives.
The open-weight value pick
#2 on Code Arena Frontend, 81.0 on Terminal-Bench 2.1, FrontierSWE within ~1% of Opus 4.8 — at $1.40/$4.40 per M and MIT weights you can self-host. Weakest on the very hardest long-horizon tasks (SWE-Marathon, NL2Repo). Pick when: you want near-frontier coding at a fraction of the cost, an open-weight fallback, or both.
The closed capability ceiling
Still leads the hardest agentic benchmarks — SWE-bench Pro (69.2), NL2Repo (69.7), SWE-Marathon (26.0) — with the deepest Claude Code harness ecosystem and a 1M context. Closed weights, premium pricing. Pick when: maximum capability on the longest, hardest tasks outweighs cost and weights access.
The other open coding specialist
Moonshot's coding-focused open-weight model under Modified MIT, with token-efficiency gains and the open Kimi Code CLI from $19/mo. A direct open-weight alternative to GLM-5.2 worth A/B-ing on your own repos. Pick when: you want to compare open coders head-to-head before committing.
The closed benchmark contender
Alibaba's flagship with a 1M context and competitive coding scores (SWE-bench Pro 60.6), at mid-tier pricing. Closed weights — no self-hosting. GLM-5.2 edges it on most coding rows and beats it decisively on Code Arena. Pick when: you are already in the Alibaba ecosystem and want a closed option.
08 — Action GuideWhat dev teams should do now.
Run your own evaluation — the cost is now near zero. With the API live at GLM-5.1 pricing and weights on Hugging Face, there is no longer a reason to take any benchmark on faith. Point a coding agent at glm-5.2 (or GLM-5.2[1m] in Claude Code), set Max effort, and run one real multi-file refactor or failing-test fix on a branch. Score end-to-end completion without intervention — that is the long-horizon claim, tested directly.
Match the model to the task tier. The benchmarks suggest a clean split: GLM-5.2 for the bulk of day-to-day agentic coding, where it is frontier-adjacent and far cheaper, and reserve Opus 4.8 for the hardest, longest tasks where it still pulls clear (large-scale refactors, multi-hour autonomous runs). A two-model routing setup captures most of the savings without giving up the ceiling.
Treat the open weights as a real lever now, not a someday. Unlike launch day, the MIT weights are downloadable, so a self-hosting or data-residency plan can actually be scoped. For regulated workloads it is a genuine path; for most teams it is a negotiating lever and a compliance fallback. Our AI transformation services cover model selection and deployment, and our web development team ships with these agentic stacks daily.
Re-check the leaderboards in a fortnight. Arena’s Elo will keep moving as vote volume grows, and independent SWE-bench and Terminal-Bench re-runs typically firm up within one to two weeks of an open-weight release. The picture here is strong and unusually well-corroborated for a day-of report, but the open weights mean the community will pressure-test it fast — let the standardize-on-it decision follow that.
The open frontier just got a lot harder to ignore.
Three days ago GLM-5.2 was a flagship with no scoreboard. Now it is the best open-weight coding model on the board, #2 on a human-preference frontend leaderboard above two Claude Opus flagships, within a point of Opus 4.8 on FrontierSWE, and shipping under MIT at unchanged GLM-5.1 prices. The launch-day caution has largely paid off: the family that earned its credibility with GLM-5 backed up the adjectives with numbers.
The honest qualifier is still worth stating. Opus 4.8 keeps the lead on the hardest, longest agentic tasks, and Z.ai’s table is its own measurement of rivals. But the robust, independently-supported conclusion is that an MIT-licensed model you can download today now does most of what the closed frontier does, for a fraction of the price. For any team running coding agents at scale, that is no longer a model to watch — it is one to test this week.