The GPT-5.2 to GPT-5.5 migration is not a string-replace on a model identifier. Reasoning-effort defaults shifted, the tool-call schema discriminates errors differently, the cost curve introduces a Pro tier, and structured outputs finally graduate from beta to production. Teams that treat the upgrade as a measurement exercise rather than a search-and-replace ship cleanly; teams that don't end up debugging mystery cost spikes on a Tuesday.
The change set is small in word count and large in surface area. Each axis — reasoning effort, tool-call schema, cost curve, structured outputs — interacts with the others in ways that are easy to miss when you treat them in isolation. A reasoning-effort default that shifts from low to medium changes both quality and cost; a stricter tool-call schema changes error-handling code that was happily silent under the old contract; the new Pro tier invites you to spend more, and frequently it's the right call. None of these are crises; all of them are easy to mishandle.
This playbook covers what actually changed, the per-axis migration steps, the phased-rollout pattern that beats org-wide cut-overs, and four common pitfalls we've seen first-hand on real production workloads. Treat it as a checklist for the next two sprints — not as a one-pager to skim and forget.
- 01Reasoning-effort default changed — measure before assuming costs are stable.The default mode on many endpoints is now medium rather than low. That single shift can change a workload's per-call output token count meaningfully, so measure before assuming your bill stays put.
- 02Tool-call schema is more strict in 5.5.Structured inputs are validated harder, and error results are discriminated as a separate result type. Audit your error handling — code that worked because the old schema was permissive may now surface caught errors as uncaught.
- 03The Pro tier is a viable option for high-reasoning workloads.Cost-per-correct-answer often wins over cost-per-token. For workloads where the model retries on its own under medium effort, paying more per call on Pro to get a single-shot correct answer can be the cheaper end state.
- 04Structured outputs are stable enough to remove your JSON-mode fallback.JSON-mode fallback patterns from the 5.2 era can be retired. Structured outputs in 5.5 are production-stable and simplify the wrapper code that used to exist purely to defend against malformed JSON.
- 05Phased rollout per-workload beats org-wide cut-over.Don't migrate everything at once. Shadow-test on one workload, measure quality and cost deltas, then graduate to the next. Org-wide flag flips create regressions that are hard to localise.
01 — What's New5.5 ships in three axes — reasoning, tools, cost.
GPT-5.5 is the kind of release that reads small on the changelog and lands large in production. The headline capability gains — stronger code reasoning, cleaner long-context behaviour, a broader native tool surface — are the parts everyone notices. The migration cost lives elsewhere, in three quieter axes that shift defaults rather than features.
Axis one is reasoning effort. The low / medium / high knob existed in 5.2, but the default behaviour and the cost curve at each level have both moved. Endpoints that used to default to low now default to medium; the per-call output token distribution under each level shifts; and the recommended mapping between business workload class and effort level looks different than it did six months ago.
Axis two is the tool-call schema. Structured inputs are validated more strictly, the error-result type is discriminated as a first-class variant rather than a string field, and the failure modes when your tool-call payload is slightly malformed have changed from soft warnings to hard errors. This is the axis that catches teams off-guard most often, because code that ran clean against the permissive 5.2 schema can surface real bugs that were always there but never triggered.
Axis three is the cost curve. The introduction of the Pro tier alongside Standard and Mini reshuffles the decision tree on which tier serves which workload. Pro is not simply "more expensive" — it shifts the cost basis from per-token to per-correct-answer, and for high-reasoning workloads where the model retries internally under medium effort on Standard, Pro can land cheaper in end-state spend.
The release also moves structured outputs from beta to GA — covered in Section 05 — and adjusts a handful of less-trafficked parameters that most teams will never notice. The four sections after this one cover the three primary axes plus structured outputs, with the phased-rollout pattern and the common pitfalls closing out the playbook.
02 — Reasoning EffortLow, medium, high — defaults and curves shift.
The reasoning-effort knob is the single most consequential migration concern. In GPT-5.2, endpoint defaults skewed toward low — short responses, minimal internal chain-of-thought, cheapest path. In GPT-5.5, most endpoints default to medium — longer internal deliberation, more output tokens, and a meaningfully different shape of answer for the same prompt. Teams that don't set the effort level explicitly are now silently consuming more reasoning per call than they were last sprint.
That shift is intentional and on the whole an improvement — medium-effort answers are typically more accurate, especially on multi-step problems. But "more accurate at higher cost, silently" is the worst-case migration path: budgets drift, latency creeps, and nobody connects it to the release until month-end billing surfaces the gap.
The right move is to pick the effort level explicitly for each workload class, and to verify the choice against measurement. The matrix below sketches the mapping we recommend across four common workload archetypes.
Routine extraction & classification
Pull a structured value from a known prompt shape — entity extraction, sentiment, tagging. The low effort level handles these cleanly in 5.5; explicitly pin to low and audit cost is roughly flat with 5.2.
Pin to lowMulti-step analysis with moderate ambiguity
Customer triage, code review, structured retrieval over ambiguous corpora. Medium is the right default — accept the cost lift, the accuracy gain typically pays back through fewer retries and clearer downstream decisions.
Use medium (default)Hard reasoning & adversarial inputs
Code with tricky concurrency, legal-style document analysis, complex agentic planning. High effort earns its keep here; under medium the model can compound small errors across hops, while high tends to recover.
Pin to highCost-sensitive bulk processing
Tens of thousands of calls per day for moderate-complexity tasks. Pin to low explicitly and route a sampling fraction (5-10%) through medium for quality measurement — gives you ground truth on whether the cost saving is costing you accuracy.
Pin low + sample mediumThe measurement step matters as much as the matrix itself. For each workload, run a shadow batch under the old default and under the new default for the same prompts, then compare: per-call output token count, end-to-end latency, downstream accuracy on a labelled sample, and cost per call. Two of those four are visible without ground-truth labels (tokens, latency), so even teams without an evaluation set can take meaningful measurements before flipping any flag.
A subtle point that catches teams: the effort knob interacts with the model tier. Medium effort on the Standard tier and medium effort on the Pro tier are not the same thing — Pro allocates more reasoning budget per effort level than Standard does, so a workload that needs high effort on Standard may run cleanly on medium under Pro. Section 04 picks this thread up.
"Measure before you migrate. The single highest-leverage move in this rollout is to capture per-call output token counts under both the old and the new default — that one number tells you most of the story."— Digital Applied agentic engineering team
03 — Tool-Call SchemaStructured inputs and error-result discrimination.
Tool calls in GPT-5.5 are typed harder than in 5.2. Two changes matter for migration: the model is more strict about the shape of structured arguments, and the result type for failed tool calls is now a discriminated variant rather than an overloaded string. Both changes are improvements on the steady state, and both produce a transition period in which previously silent bugs surface as visible errors.
Stricter argument validation
In 5.2, sending a tool call argument that didn't exactly match the declared schema — a number where a string was expected, a missing optional field — could pass through as a soft coercion. In 5.5, the same shape lands as a validation error. The behaviour is more correct; the migration cost is real, because any tool whose arguments were quietly being massaged now needs its schema declaration cleaned up.
Discriminated error results
Tool-call results in 5.5 carry an explicit type discriminator — a result is either a success variant carrying the tool's return payload or an error variant carrying a structured error description. In 5.2, errors were typically squeezed into a string field on the success path; consumers learned to string-match for "error" prefixes or to parse JSON inside the success body. That code becomes unnecessary in 5.5 and, more importantly, can become actively wrong: an error result that the 5.2 consumer treated as success becomes a different code path entirely under the new schema.
Migration steps for each tool
For every tool registered in your agent surface: re-validate the JSON Schema for the argument shape against the strictest interpretation; add explicit handling for the error variant wherever the tool result is consumed; remove ad-hoc error-string detection in the consumer; and add a shadow assertion that no tool call ever returns the error variant under nominal load — when one does, you've found a real production bug that 5.2 was hiding.
Audit every tool schema
JSON Schema validation passRun each declared schema through a strict validator. Tighten optional-vs-required, fix number-vs-string mismatches, remove permissive any types. Cheaper to fix in the schema than in production error handling.
Pre-flight checkDiscriminate error variants
result.type === 'error'Add explicit branches at every tool-result consumption site. Treat the error variant as a first-class outcome with its own observability — log the structured error payload, surface to monitoring, decide retry policy per error code.
Replaces string-matchShadow-assert nominal paths
alert on error-rate spikesAdd an instrumented assertion: under nominal load, the error variant rate stays below baseline. A spike indicates a real bug surfaced by the stricter schema rather than a regression in the model.
Bug discovery, not regression04 — Cost CurvePer-tier pricing and the Pro tier introduction.
The cost curve is where the "axis three" story gets interesting. GPT-5.5 introduces a Pro tier alongside the existing Standard and Mini tiers. Pro is not simply "Standard with more compute" — it shifts the cost-per-call basis in a way that changes which workloads land cheapest under which configuration. Understanding the new decision tree is the difference between a clean migration and a bill spike that nobody can localise.
The illustrative bars below sketch indicative relative pricing across the three tiers for a same-prompt, same-effort workload. Treat the exact numbers as illustrative — always confirm current pricing against OpenAI's pricing page before making procurement decisions. The shape of the curve is the point.
Indicative relative cost per call · GPT-5.5 tier × effort matrix
Source: Illustrative relative pricing — confirm current rates against OpenAI's pricing pageThe non-obvious insight is the comparison between Standard at high effort and Pro at medium effort. The two are close on sticker price, and for workloads where the Standard-high configuration tends to retry internally before landing on a correct answer, Pro-medium often arrives at the same answer in fewer total tokens — making the apparent premium illusory once you measure cost-per-correct-answer rather than cost-per-token.
That reframing — cost-per-correct-answer — is the right way to evaluate Pro. For workloads where the bill is dominated by retries, fallbacks, or downstream review cycles caused by wrong answers, paying more per call on Pro to get a clean single-shot answer can land cheaper end-to-end. For workloads where the model is already nailing the first attempt under Standard, Pro is dead weight — same answer, higher cost. The phased rollout in Section 06 is how you tell the two cases apart on your own workloads.
Mini is the third leg of the stool. For high-volume, low- complexity tasks — bulk classification, entity extraction, language detection — Mini is dramatically cheaper than Standard and typically performs within a margin of error on those task classes. Teams running large extraction pipelines should expect Mini to be a strict cost win on most of their traffic, with Standard reserved for the trickier residual.
Bulk processing tier
Cheapest path for high-volume, low-complexity workloads. Pair with low effort for extraction, classification, simple summarisation. Sample 5-10% through Standard for quality measurement.
Cost dominant for bulkDefault production tier
Right call for most production workloads. Pick effort level explicitly per workload class — the matrix in Section 02 gives the starting point. Medium is the new default and is right more often than not.
The new defaultFrontier reasoning tier
Worth it when cost-per-correct-answer beats cost-per-token. Hard reasoning, complex agentic planning, workloads where retry cycles dominate the bill. Measure end-to-end spend, not sticker price.
Per-correct-answer logic05 — Structured OutputsFrom beta to production-ready.
Structured outputs were a beta feature in 5.2 — useful, but with enough edge cases that most production teams kept their JSON-mode fallback wrapper in place. In 5.5, structured outputs graduate to general availability with strict-mode validation on, robust handling of nested schemas, and reliable behaviour on long generation. The practical consequence is that the fallback wrappers your team wrote in 2025 can be deleted.
That deletion is not just code hygiene. JSON-mode fallback wrappers add latency (a retry path that occasionally fires), a failure mode (malformed JSON that the wrapper's parser doesn't handle), and review burden (the code is plumbing nobody enjoys maintaining). Removing the wrapper makes the integration thinner, faster, and easier to reason about.
What "production-ready" means here
Three properties have firmed up since 5.2: schema-conformance is essentially deterministic — outputs match the declared schema or the call returns an explicit error rather than silently corrupted JSON; nested object validation is reliable even on deeply nested types; and long-generation cases (large arrays, long strings) no longer truncate mid-token in ways that produce invalid output. Those three together are what separate a beta feature from production-ready plumbing.
Migration steps
For each integration point: replace the legacy JSON-mode request shape with a structured-output schema declaration; remove the wrapping try-catch on JSON.parse; remove the retry-on-parse-error path; and audit downstream consumers that may have been quietly tolerating slightly-malformed payloads. The audit step matters — code that swallowed bad JSON gracefully under the old contract may now see no payload at all when the model would have previously emitted a corrupted one.
06 — Phased RolloutShadow-test before cut-over.
The single biggest migration mistake is the org-wide flag flip on a Friday afternoon. Frontier-model upgrades interact with production code in ways that are hard to predict from the release notes alone, and a regression in one workload during a global cut-over typically gets blamed on whichever workload happened to be looked at first — not on the actual culprit. Phased rollout per-workload eliminates that whole class of mystery debugging.
The four-phase shape below is the pattern we run on every migration engagement. It typically spans one to two sprints depending on workload count, and the parallel shadow phase does most of the de-risking work.
Shadow in parallel
5.2 + 5.5, same prompts, no traffic shiftRun both models in parallel on the same inputs. Capture per-call output tokens, latency, schema-validation outcomes, and (where available) downstream quality signals. Zero production risk; full measurement.
Duration: 3-5 days per workloadMeasure & decide per workload
decision matrix, not vibesCompare phase-01 metrics against a written acceptance threshold per workload — cost delta, latency delta, quality delta. Document the decision: migrate now, migrate with effort adjustment, or defer with a reason.
Written threshold, not gut feelCanary a slice
5-10% of production trafficRoute a small fraction of real traffic to 5.5 for the workloads that passed phase 02. Watch error rates, p95 latency, and cost telemetry for at least 48 hours before widening. Cheap insurance against measurement gaps in the shadow.
48 hour minimum holdFull cut-over + rollback drill
100% traffic, tested rollbackMove to 100% on the workload and immediately test the rollback path. The rollback procedure is only real if you've executed it under nominal load — a paper rollback is a wish, not a procedure. See Section 07 for the rollback specifics.
Test the rollback, don't just write itThe order is load-bearing. Skipping the shadow phase saves a week and surfaces problems in production instead. Skipping the canary phase saves another day and removes your last cheap chance to spot regressions before they affect every user. Skipping the rollback drill saves an hour and leaves you with a written procedure that turns out not to work on the afternoon you most need it.
For teams without the bandwidth to run all four phases on every workload, the right compression is to skip phase 03 on workloads with strong phase-02 signal — large cost delta in favour of 5.5, no quality regression, low business risk if something goes wrong. Never skip phases 01 or 04. Our AI digital transformation engagements run this pattern across the migration window and own the measurement step end-to-end.
07 — Common PitfallsFour ways the migration surprises.
Four failure modes recur across the migrations we've run. None are show-stoppers; all are easier to pre-empt than to debug.
Pitfall 01 — The silent cost lift from default-shift
Teams that don't set reasoning_effort explicitly run into the medium-default cost lift two to four weeks after migration, typically surfacing at month-end billing review. The fix is preventive: set effort explicitly for every workload as part of the migration, never rely on the platform default. The cost itself isn't the problem; the surprise is.
Pitfall 02 — Tool-call error-variant ignored
Code that worked under 5.2 because error results came back as strings on the success path can silently swallow real errors under the 5.5 discriminated schema. The error variant arrives, the consumer's success-path handler runs, and the downstream effect is a half-completed transaction or a mis-stored record. Mitigation is the schema audit in Section 03 — every tool-result consumer must explicitly handle the error variant.
Pitfall 03 — Pro tier adopted without measurement
Pro is genuinely the right call for some workloads, and a waste for others. The pitfall is rolling out Pro broadly without running the cost-per-correct-answer measurement — teams pay 12× for workloads where Standard would have been indistinguishable in outcome. Always shadow Standard against Pro on the same prompts before moving budget.
Pitfall 04 — Structured-output fallback left in place
The JSON-mode fallback wrapper from the 5.2 era keeps working, so nothing breaks if you leave it in. But it adds latency, a failure mode, and review burden. Six months after the migration, the team has half-deprecated wrappers in five integration points and nobody remembers why. Delete the wrappers as part of the migration, not as a follow-up that never happens.
For the first 72 hours after cut-over, keep the 5.2 model identifier wired behind a single environment variable per workload. Rollback = flip the variable, redeploy, done. Test the path during phase 04 of the rollout — a procedure you've never executed is not a procedure. After 72 hours without incident, the variable can be retired in a follow-up PR. For broader context on AI-in-production patterns, our Codex test-generation pipeline tutorial covers the same fail-open posture applied to CI workflows, and our LLM API pricing index gives the current cost-per-token landscape across providers.
For most teams running 5.2 in production today, the GPT-5.5 migration is a one-to-two-sprint exercise that pays back through cleaner tool-call error handling, simpler structured- output plumbing, and a more honest cost picture. Run the measurement, follow the phases, set the defaults explicitly, and the migration is uneventful — which is exactly what you want from a frontier-model upgrade. For deeper background on the agentic-engineering side of these migrations, see our LLM API pricing index for the broader cost landscape and the Codex test-generation pipeline tutorial for the fail-open patterns that apply to any AI-in-production rollout.
Migration cadence beats version-bump anxiety — phased rollouts win every time.
GPT-5.5 is the kind of release that rewards discipline. The capability gains are incremental; the default shifts are substantive. Teams that treat the migration as a measurement exercise — shadow-test, decide per workload, canary, cut-over, rollback drill — ship cleanly and capture the genuine wins (cleaner error handling, simpler structured outputs, honest cost decisions). Teams that flip the model identifier in a config file and hope spend the next month debugging mystery bills and silent regressions.
The broader pattern is more interesting than the specific release. Frontier models will keep shipping point upgrades quarterly for the foreseeable future, and the migration playbook generalises: shadow in parallel, measure per workload, set defaults explicitly, audit the surface area that just got stricter, delete the plumbing that just became unnecessary, and never run a global cut-over on a Friday. That cadence is what separates teams that benefit from frontier upgrades from teams that flinch every time the release notes drop.
One sprint of measurement saves a quarter of mystery debugging. The migration is small if you treat it as small; it becomes large the moment you skip the phases. Pick the workloads, run the shadow, write the thresholds, execute the canary, drill the rollback. The release is uneventful — and uneventful is exactly the goal.