GLM-5.2 API access is no longer a one-vendor decision. Since Z.ai published the model’s weights under an MIT license on June 16, 2026, roughly 20 independent inference hosts have stood up their own endpoints — and on July 3, 2026, OpenRouter alone routed 29 live GLM-5.2 listings with input prices spanning $0.93 to $3.00 per million tokens.
That spread is the whole story. Pick well and you pay about a third less than Z.ai’s own sticker. Pick carelessly and you pay a 2x premium — or worse, your 128K-token agent outputs get silently truncated to 32K because the cheapest route carries the tightest output cap on the market. The cheapest listing and the right listing are not the same thing.
This guide prices every route worth considering. We pulled OpenRouter’s live endpoints API, cross-checked each figure against the provider’s own model page, and layered in Artificial Analysis’s independent speed benchmarks — then built the scoreboard table and decision rows no single source publishes. Subscription plans, the free Flash tier, and the self-hosting reality check round it out.
- 01Z.ai’s list price is the market median, not a premium.The official API lists $1.40 input / $0.26 cached / $4.40 output per Mtok. Fireworks and Novita list the identical sticker on their own pages — the anchor every other route is judged against.
- 02DeepInfra is the market floor — with a catch.At $0.93–0.95 in / $0.18 cached / $3.00 out (fp4), DeepInfra Standard is the cheapest active listing across three independent surfaces. But its OpenRouter route caps output at 32,768 tokens — the tightest ceiling found.
- 03OpenRouter is a marketplace, not a price.29 live endpoints across ~20 provider brands on July 3, 2026. The sub-sticker input band ran $0.93–$1.20 and fluctuates; the full span reached $3.00 in / $10.25 out on premium latency tiers.
- 04Output caps and quantization decide more than price.Max-completion ceilings span 32,768 to 1,048,576 tokens by host, and quantization varies between fp8, fp4, and unstated. No GLM-5.2-specific fp8-vs-fp4 quality benchmark exists yet — run your own evals.
- 05Pay-per-token isn’t the only door.The GLM Coding Plan lists at $18 / $72 / $160 per month for quota-based access, and Z.ai’s pricing table lists GLM-4.7-Flash as fully free — a genuine no-cost tier for validating workloads first.
01 — Why It MattersWhy one model has twenty prices.
GLM-5.2 was announced on June 13, 2026 — GLM Coding Plan subscribers first, API and chatbot access days later — and the open weights landed on Hugging Face and ModelScope under an MIT license on June 16. That license is why this comparison exists at all: anyone with datacenter GPUs can legally serve the model, so within weeks the market filled with independent hosts competing on price, speed, and infrastructure story.
The demand side is simple economics. Anthropic’s Claude Opus 4.8 lists at $5 per million input tokens and $25 per million output. Z.ai’s GLM-5.2 list price is roughly 3.6x cheaper on input and 5.7x cheaper on output. The honest framing — the one our own GLM-5.2 benchmark breakdown lands on — is that GLM-5.2 is near-frontier on many single-shot coding benchmarks at a fraction of the cost, but trails Opus 4.8 on sustained long-horizon agent work. Most teams we work with treat it as a pairing, not a replacement: route high-volume, well-scoped tasks to the cheap model and keep the frontier model for the work that justifies frontier pricing.
Z.ai list input, per Mtok
The official API’s sticker — $1.40 input, $0.26 cached input, $4.40 output. Per Z.ai’s own pricing docs as of July 3, 2026, with cached-input storage listed as limited-time free.
cheaper input at list
Opus 4.8 lists $5 in / $25 out per Mtok. GLM-5.2’s list is roughly 3.6x cheaper on input and 5.7x on output. That gap — not benchmark parity — is why teams shop this market.
license since June 16
Announced June 13, 2026; MIT weights published June 16 on Hugging Face and ModelScope. Permissive licensing is what lets ~20 independent hosts undercut the vendor’s own API.
02 — The Vendor RouteZ.ai native: the anchor price.
Z.ai’s own API is the reference implementation: $1.40 per million input tokens, $0.26 for cached input, $4.40 for output, with cached-input storage listed as “Limited-time Free” on the official pricing page. The model card states a 1M-token context window and a 128K-token maximum output — the most generous output ceiling among the mainstream routes, and the cap most marketplace hosts mirror.
Interestingly, the sticker is not a vendor premium. DeepInfra’s own market analysis, citing Artificial Analysis data, puts the median tracked provider price at exactly Z.ai’s list — $1.40 / $4.40 / $0.26. Fireworks and Novita list the identical figures on their own model pages. In practice, $1.40 in / $4.40 out is the market’s consensus price, and everything else is a deliberate deviation: discounts to win traffic, or premiums to sell speed.
Two more things distinguish the native route. First, the GLM Coding Plan — Z.ai’s subscription — only works against Z.ai’s own endpoint, so plan subscribers who follow our Claude Code setup guide are on this route by definition. Second, data handling: Z.ai’s privacy policy states that services are generally provided from Singapore and that API request content is not stored — “The Company do not store any of the content the Customer or its End Users provide or generate while using our Services.” Its parent company is registered in Beijing, a separate, secondary-sourced fact about legal domicile rather than data flow — we unpack both in the decision rows below.
If you ran the predecessor model through a gateway, note that the playbook from our GLM-4.6 API deployment guide still applies structurally — same OpenAI-compatible surface, same Anthropic-compatible proxy path — but every price in it is now stale. GLM-5.2 repriced the whole market.
03 — The MarketplaceOpenRouter: 29 endpoints, one API key.
OpenRouter is where the price competition becomes visible. On July 3, 2026, its live endpoints API for z-ai/glm-5.2 returned 29 distinct listings across roughly 20 provider brands — DeepInfra, GMI Cloud, Novita, Fireworks, Wafer, Decart, Baidu, SiliconFlow, Together, Cloudflare, and more — several of them listing both a standard and a “fast” tier. One API key, one request format, twenty backends bidding for your traffic.
Because it is a marketplace, OpenRouter has no single GLM-5.2 price — treat every number as a range with a date on it. At our July 3 snapshot, the sub-sticker band ran $0.93–$1.20 per Mtok input across the discounted routes, while the full span stretched from $0.93 to $3.00 on input and $3.00 to $10.25 on output once premium latency tiers are included. These rates fluctuate: Novita’s OpenRouter route, for example, showed $1.09 in / $3.43 out — a live promotional rate roughly 22% below Novita’s own $1.40 / $4.40 sticker, not a list price. At the top end, Wafer’s “fast” tier listed $3.00 in / $10.25 out — about 2.1x Z.ai’s list input and 2.3x its output — sold on lowest end-to-end latency.
One caution on provider counts, because every source publishes a different number. OpenRouter’s raw JSON returned 29 endpoints across ~20 brands; DeepInfra’s July 1 blog counts 25 listed providers from a different snapshot; Artificial Analysis benchmarks 15 high-uptime routes; and models.dev shows 42 rows because it counts subscription SKUs and regional variants as separate entries. None of these is wrong — they measure different scopes — but they are not interchangeable, and any post quoting a single “X providers” headline without a source and date is already stale.
GLM-5.2 input price per Mtok · observed July 3, 2026
Source: OpenRouter live endpoints API + provider pages, observed July 3, 2026 — marketplace rates fluctuate04 — The ScoreboardEvery route worth pricing, triangulated.
The table below is the asset this post exists for. Every cell cross-references OpenRouter’s live endpoint JSON against the provider’s own pricing or model page, as of July 3, 2026 — no single published source combines both surfaces, which is exactly how promotional rates get miscited as list prices. Context is 1,048,576 tokens nearly everywhere; the columns that actually separate the routes are price, output cap, and quantization.
| Route | Input $/Mtok | Cached $/Mtok | Output $/Mtok | Max output | Quant | Notes |
|---|---|---|---|---|---|---|
| List-price routes — the $1.40 / $0.26 / $4.40 sticker | ||||||
| Z.ai (native API) | $1.40 | $0.26 | $4.40 | 131,072 | fp8 | The anchor row. Cached-input storage listed limited-time free; only route the GLM Coding Plan works against. |
| Fireworks AI | $1.40 | $0.26 | $4.40 | not stated | not stated | Day-zero launch partner; weights self-hosted on US infrastructure. Cursor’s GLM-5.2 provider. |
| Novita (own list) | $1.40 | $0.26 | $4.40 | 131,072 | fp8 | Matches the Z.ai sticker exactly on its own model page. |
| Below list — observed July 3, 2026 via OpenRouter (rates fluctuate) | ||||||
| DeepInfra Standard | $0.93–0.95 | $0.18 | $3.00 | 32,768 | fp4 | Cheapest active listing found — but the tightest output cap on the market. $0.93 on OpenRouter, $0.95 on DeepInfra’s own blog. |
| GMI Cloud | $0.98 | $0.182 | $3.08 | 131,072 | fp8 | Lowest blended price on Artificial Analysis’s tracker; keeps the 128K output ceiling. |
| Novita via OpenRouter | $1.09 | $0.20 | $3.43 | 131,072 | fp8 | Live promotional rate, ~22% off Novita’s own sticker — not a list price. |
| Wafer (standard) | $1.20 | $0.20 | $4.10 | 131,072 | fp4 | Mid-band discounter; also sells the priciest fast tier below. |
| Decart | $1.20 | $0.20 | $4.20 | 1,048,576 | fp4 | Lists the full context window as its completion ceiling — verify before relying on it. |
| Premium & speed tiers | ||||||
| DeepInfra Priority | $1.425 | $0.27 | $4.50 | not stated | fp4 | Exactly 1.5x its blog-stated Standard rate, for guaranteed throughput. |
| Fireworks Fast | $2.10 | $0.21 | $6.60 | not stated | not stated | 1.5x Fireworks’ standard OpenRouter listing ($1.40 / $0.14 / $4.40) — same multiplier pattern as DeepInfra. |
| Wafer Fast | $3.00 | $0.50 | $10.25 | 131,072 | fp4 | Highest-priced active listing — roughly 2.1x Z.ai’s list input, 2.3x its output — sold on lowest latency. |
One correction worth making explicit, because several published comparisons repeat it: DeepInfra is often placed mid-pack at around $1.20 in / $4.20 out. That figure actually matches Decart’s route ($1.20 / $4.20) and roughly Wafer’s standard rate ($1.20 / $4.10) — not DeepInfra’s. Three independent surfaces — DeepInfra’s own July 1 blog post, its model page, and OpenRouter’s live endpoint JSON — all put DeepInfra Standard at $0.93–0.95 in / $3.00 out. DeepInfra is the market floor, not a mid-tier option, and any comparison that says otherwise is working from a stale snapshot.
The structural read: the market has settled into three bands. A consensus sticker at $1.40 / $4.40 that Z.ai, Fireworks, and Novita all defend; a discount band between $0.93 and $1.20 where hosts buy traffic with promos and aggressive quantization; and a premium band at 1.5x-plus multiples where the product being sold is latency and guaranteed throughput, not tokens. Where you should sit in that stack depends entirely on workload shape — which is what the decision rows below are for.
05 — The Fine PrintOutput caps and quantization: the silent differentiators.
Price-per-token is the visible axis; the two invisible ones bite harder. First, output caps. Z.ai’s native route allows 128K output tokens, and most marketplace hosts mirror the 131,072 ceiling — Novita, Baidu, GMI Cloud, AtlasCloud, Alibaba, Venice, Wafer, and others. SiliconFlow lists 262,144. A handful — Morph, Friendli, Decart, Phala — list the full 1,048,576-token context as their completion ceiling. And DeepInfra, the cheapest route on the board, caps completions at 32,768 tokens: the lowest of any active listing found.
Second, quantization. OpenRouter’s metadata shows the fleet split roughly three ways: fp8 hosts (GMI Cloud, Novita, Baidu, SiliconFlow, Venice, and Z.ai’s own route as tagged), fp4 hosts (DeepInfra, Wafer, Decart, Parasail, Inceptron), and a group that discloses nothing (Together, Cloudflare, Friendli, Morph, DigitalOcean, among others). NVIDIA publishes an official NVFP4-quantized GLM-5.2 variant for datacenter deployment, so a vendor-blessed fp4 recipe exists — but that is not the same as a quality guarantee for any particular host’s implementation.
On quality impact, the honest answer is qualitative: general LLM literature suggests fp8 typically lands very close to the full-precision baseline, while fp4 is more implementation-dependent — and as of July 3, 2026, no GLM-5.2-specific fp8-vs-fp4 head-to-head benchmark had been published. Nobody can currently tell you what the fp4 discount costs in output quality, including the hosts selling it. If your workload is quality-sensitive, run your own evals on the exact route you plan to buy, not on Z.ai’s reference endpoint.
06 — Decision RowsPick your route by workload, not by headline.
Five buyer profiles cover nearly every team we have priced this market for. Match yours, then verify the live rate on the route before committing — every marketplace number in this post is a July 3, 2026 snapshot.
Cost-first batch work
DeepInfra Standard at $0.93–0.95 in / $3.00 out (fp4) is the cheapest active listing across three independent surfaces. Fits summarization, classification, and short-output codegen — anything comfortably under its 32,768-token output cap.
Sticker-stable first-party routes
Z.ai native and Fireworks both hold the $1.40 / $4.40 list with vendor-stated caps and first-party infrastructure — no marketplace re-routing, no promo churn repricing your workload mid-month. Fireworks says it validated GLM-5.2 independently before its day-zero launch.
Full-context completions
Z.ai’s 128K output cap is the dependable ceiling. Morph, Friendli, Decart, and Phala list the full 1,048,576-token context as their completion cap on OpenRouter — worth testing for extreme generation jobs, but verify the behavior before building on it.
Latency-sensitive UX
Artificial Analysis clocks Together AI as the fastest tracked route at 345.9 tokens/sec. If you need contractual throughput instead, that’s what the 1.5x tiers sell: DeepInfra Priority, Fireworks Fast, and Wafer Fast at the top of the price board.
Compliance-bound teams
Z.ai’s privacy policy states Singapore processing and no storage of API content, while its parent is Beijing-registered — legal domicile and data flow are separate facts. If policy requires US infrastructure, Fireworks self-hosts the weights, and it is the provider Cursor’s team says serves its GLM-5.2 traffic. MIT self-hosting removes external routing entirely.
"Your request runs on Fireworks infrastructure, through the Fireworks inference engine, on weights hosted in-house. The traffic is never forwarded anywhere."— Fireworks AI launch blog, June 16, 2026
The residency question is not hypothetical — it is being asked in public. A Cursor community-forum thread titled “Do GLM 5.2 requests go to China?” drew an official reply from a Cursor team member: “the provider for GLM 5.2 and also other Cursor models is Fireworks Ai” — pointing enterprise users at Fireworks’ own trust center rather than Z.ai. That is the pattern we expect more downstream products to follow: open weights served on domestic infrastructure, with the model’s origin and its data path cleanly separated. If your team needs help pressure-testing that kind of routing decision — model, host, and compliance story together — that comparative eval is exactly where our AI transformation engagements start.
07 — Beyond Pay-Per-TokenPlans, the free tier, and the self-hosting reality.
Metered tokens are not the only way in. For steady daily usage — especially agentic coding through Claude Code or similar tools — Z.ai’s subscription can undercut every per-token route on this board, and at the other extreme, the free Flash tier costs nothing at all.
GLM Coding Plan
Quota-based access on rolling 5-hour windows instead of metered tokens, working with 20+ coding tools including Claude Code. GLM-5.2 draws 3x quota at peak and 2x off-peak — with a promotional 1x off-peak rate Z.ai states runs through the end of September 2026. Billing-term discounts of 10-30% apply on top of list.
GLM-4.7-Flash
Z.ai’s official pricing table lists GLM-4.7-Flash and GLM-4.5-Flash as fully free across input, cached input, and output. A genuine no-cost tier for validating prompts and agent scaffolding before committing GLM-5.2 spend.
MIT weights, datacenter hardware
The weights are free to download; serving them is not. Full-precision VRAM needs run roughly 25x a single A100-80GB, and even the smallest 1-bit quant is ~217 GB. For most teams, the API routes above are the realistic path.
The break-even math between plan and metered billing depends on your usage shape — we run the full calculation in our Coding Plan value analysis. The short version: spiky or low-volume API usage favors metered tokens; daily agentic coding favors the plan. The plan also feeds Z.ai’s wider tooling — ZCode, the free desktop agentic development environment Z.ai shipped the week of July 1, 2026, runs on the same quotas, with a 1.5x-quota launch promotion Z.ai states expires July 31, 2026 (our ZCode guide covers it in full).
If the subscription math fits your usage, you can start a GLM Coding Plan directly on Z.ai — every tier runs against the same native endpoint this comparison prices. Referral link: we earn Z.ai platform credits if you subscribe, and new Z.ai accounts get 10% off their first subscription order.
And if you are tempted by the MIT license to run the model yourself, read why self-hosting GLM-5.2 is unrealistic for most teams first — the hardware bill dwarfs years of API spend at these per-token prices.
08 — ConclusionThe cheapest route is a spec sheet, not a number.
Buy the route that matches your workload — then re-check the price next month.
The GLM-5.2 access market is genuinely competitive three weeks after the MIT weights dropped: a consensus sticker at $1.40 / $4.40 held by Z.ai, Fireworks, and Novita; a discount band down to $0.93 where DeepInfra sits as the verified floor; and premium tiers selling latency at up to $3.00 in / $10.25 out. The spread between the floor and the ceiling is more than 3x on input — for the same weights.
But every cheap route carries a spec sheet. DeepInfra’s floor price comes with a 32,768-token output cap and fp4 quantization whose quality delta nobody has published. Marketplace promos like Novita’s ~22% discount reprice without notice. If your workload needs the full 128K output, predictable behavior, or a clean US data path, the right answer is a list-price route — and that is fine, because list is still roughly 3.6x cheaper than Opus 4.8 on input. GLM-5.2 earns its slot as the high-volume workhorse in a paired stack; the frontier model keeps the long-horizon agent work.
Looking forward, expect the discount band to keep compressing while the 1.5x speed-tier multiplier holds — that pattern (DeepInfra Priority, Fireworks Fast) already looks like market convention, and promotional churn will only accelerate as more hosts fight for OpenRouter’s default-route traffic. Plan your budget on list prices, treat anything below them as a bonus you re-verify monthly, and pin your provider so a marketplace repricing never silently changes what model quality you are actually buying.