OpenAI Codex Record & Replay lets you demonstrate a macOS workflow one time — Codex watches, then writes a reusable SKILL.md you can run on demand with new inputs. Announced on June 18, 2026 in the Codex desktop app (version 26.616), it turns a one-off demo into an inspectable, editable automation, no scripting required.
The reason this matters is not that screen recording is new — it is that the recorder now captures intent in natural language rather than pixel coordinates. That shift is what separates Record & Replay from a decade of brittle macro tools, and it is why knowledge workers, not just developers, are the audience OpenAI is courting.
This guide covers exactly what shipped, how the demo-to-skill loop works, why the architecture differs from traditional RPA, the reliability math that decides which tasks are safe to automate, six office tasks worth recording first, and the launch catches — macOS-only and a regional exclusion — that European teams need to know about before they plan around it.
- 01One demo becomes a reusable skill.You perform a workflow once on macOS; Codex turns the recording into a SKILL.md file that specifies when to use the workflow, what inputs it accepts, the steps, and how to verify success. You then run it on demand with new inputs.
- 02It captures intent, not coordinates.Recording stores actions and window content as JSON that preserves semantic intent rather than pixel positions. At replay, Codex interprets those steps against the current screen, which is why it survives small UI changes that break classic macro recorders.
- 03SKILL.md is a shared, portable format.The skill file is a human-readable, LLM-interpretable document. SKILL.md is an open format adopted across several major AI coding tools, so skills are not locked to a single vendor — a genuinely useful property when choosing a toolchain.
- 04Reliability compounds across steps.AI agents averaged 66% success on the OSWorld benchmark in the 2026 AI Index. Short, stable workflows replay well; long ones decay fast because per-step error multiplies. Pick tasks accordingly.
- 05Launch limits are real but framed as current.At launch Record & Replay is macOS-only and unavailable in the EEA, UK, and Switzerland, and it requires Computer Use. These are current constraints, not stated as permanent — verify your platform and region before planning.
01 — What ShippedA demo-to-skill recorder, live in the Codex desktop app.
Record & Replay shipped on June 18, 2026 in the Codex desktop app at version 26.616. The premise is simple: you start a recording, perform a recurring task on your Mac — filing an expense report, submitting a time-off request — and stop the recording when you’re done. Codex turns that single demonstration into an inspectable, editable skill it can run again later with different inputs. The same 26.616 update also added bulk actions for automation run history and the ability to hand a thread off between local and remote hosts.
The feature is available to ChatGPT Plus, Pro, Business, Enterprise, and Edu subscribers. The Codex app itself is free to download, but Record & Replay requires a paid plan — Plus at $20/month, Pro at $200/month, or one of the Business, Enterprise, or Edu tiers. You trigger a recording from the Plugins panel in the desktop app (Plugins, then the “+”, then “Record a skill”) and replay it by referencing the skill name in a new thread.
Show it once
You control when recording starts and stops. Codex captures your actions and window content, storing the workflow as JSON that preserves semantic intent rather than pixel coordinates. Maximum session length is 30 minutes.
Run it on demand
Reference the skill by name in a new thread with new inputs. Codex uses whatever tools are available — Computer Use for desktop GUI control, browser actions for web tasks, and installed plugins for integrations like Slack, Gmail, Notion, and Salesforce.
Context for the timing: OpenAI Codex surpassed 5 million weekly active users as of June 2, 2026, up roughly 6x since the desktop app launched in February 2026. Knowledge workers now make up about 20% of Codex users and are reportedly growing more than 3x faster than developer users. Record & Replay is squarely aimed at that growth: the people most likely to record a repetitive office task are not the engineers who could script it, but the operations, marketing, and finance staff who currently do it by hand.
02 — How It WorksOne demonstration becomes an editable SKILL.md.
The artifact Record & Replay produces is a SKILL.md file: a human-readable, LLM-interpretable instruction document. It specifies when to use the workflow, what variable inputs it accepts, what steps to follow, and how to verify successful completion. Because it is a plain document, you can open it, read it, and edit it by hand — fix a mislabelled step, tighten a verification check, or generalise an input — without re-recording from scratch.
Under the hood, recording uses the macOS Screen Recording and Accessibility APIs to capture both your actions and the on-screen window content. That data is stored as JSON that preserves semantic intent — “click the Submit button,” not “click at pixel 412, 880.” At replay time Codex loads the skill and interprets those steps against the current state of the screen, choosing tools as it goes.
How Codex decides when to run a skill
A SKILL.md file requires two frontmatter fields: a name (the skill identifier) and a description (the trigger signal Codex uses to decide when to invoke the skill). To keep its context window lean, Codex initially loads only the name, description, and file path of each skill — keeping the initial skills list under roughly 8,000 characters, about 2% of the context window. The full SKILL.md content is only loaded once a skill is actually selected.
You can invoke a skill explicitly with a $skill-name reference or the /skills command, or implicitly when Codex matches your task description to a skill’s description field. Skills resolve through a storage hierarchy in priority order: a repo’s .agents/skills in the current directory, then parent and root repo folders, then a user-level $HOME/.agents/skills, then an admin-managed /etc/codex/skills, then the system-bundled defaults.
name + description
Every SKILL.md needs a name (the identifier) and a description (the trigger signal Codex matches against your task). The description is what makes a skill discoverable, so write it as a clear when-to-use sentence.
Lazy skill loading
Codex lists only name, description, and path for each skill — keeping the initial listing near 8,000 characters, about 2% of context. Full step content loads only when a skill is selected, so many skills stay cheap.
Per-session ceiling
A single recording session caps at 30 minutes. That is plenty for most office tasks and is itself a nudge toward short, well-scoped workflows — which, as the next section shows, are also the ones that replay most reliably.
03 — ArchitectureWhy this isn’t the macro recorder you remember.
Programming by demonstration is not a new idea — researchers have chased it since the mid-1980s. Traditional robotic process automation (RPA) tools such as UiPath and Automation Anywhere are its commercial descendants, and they share a well-known weakness: they capture pixel coordinates and rigid selectors that break the moment a button moves, a layout reflows, or a web app ships a redesign. Generalising from a single demonstration to a slightly different screen was the problem that defeated those systems.
Record & Replay routes that generalisation through a language model instead. The SKILL.md captures intent in natural language, and at replay Codex interprets that intent against whatever the screen actually looks like now. That is the meaningful architectural distinction: the recorder stopped storing where you clicked and started storing what you were trying to do.
OpenAI Codex shipped Record & Replay. This is the first Codex update that makes employee replacement easy to picture. You perform a workflow once on your Mac. Codex watches, turns it into a reusable Skill, then runs the same workflow next time with new inputs.— Kai (@hqmank), community observer, June 20, 2026
04 — ReliabilityThe reliability math that decides what to automate.
Before recording anything, it helps to understand where these agents still fail. According to the Stanford 2026 AI Index, AI agents achieved an average 66% task success rate on the OSWorld benchmark — up sharply from 12% a year earlier, but still a 34% failure rate in unsupervised desktop automation. That headline number is per task; the more useful lesson is what happens to multi-step workflows.
Reliability compounds. As an illustrative example cited in coverage of the launch, an agent that is 85% reliable on each individual step of a 10-step workflow succeeds end-to-end only about 20% of the time, because the per-step error rate multiplies across the chain. (That figure comes from an analysis on Temporal.io’s blog, not a peer study, so treat it as illustrative rather than measured.) The takeaway is the same either way: short, stable workflows are far safer candidates than long, fragile ones.
Estimated per-run success vs workflow length · illustrative
Illustrative: per-run estimate = 0.96^(step count). Calibrated so a 10-step flow lands on the OSWorld 66% average (Stanford 2026 AI Index). Not a measured per-task rate.The bars above are a deliberately simple model, not measured data: they assume each step replays at about 96% reliability and compound that across the workflow, which by design lands a 10-step flow on the OSWorld 66% average. The point is the shape of the curve, not the exact percentages — every additional step you record is another place the replay can stumble. That is the lens to bring to the six tasks below.
05 — Task SelectionSix office tasks worth recording first.
OpenAI’s own examples — filing an expense report, submitting a time-off request — are a good start, but the reliability curve suggests an ordering. The matrix below scores six common office workflows by step count, interface stability, an illustrative per-run estimate (derived from the same 0.96-per-step model above), where a human still needs to check, and an overall fit rating. Record the high-fit ones first; treat the low-fit ones as assisted, not unattended.
| Workflow | Typical steps | Interface stability | Est. per-run | Human still needed | Fit |
|---|---|---|---|---|---|
| Time-off / HR system request | 3 | Static portal | ~88% | Spot-check dates | High |
| Attendance / time-logging entry | 4 | Static portal | ~85% | Periodic audit | High |
| Recurring data export / report download | 5 | Mostly static dashboard | ~82% | Verify file landed | High |
| Expense report submission | 6 | Form + receipt upload | ~78% | Confirm amounts | Medium |
| CRM / Jira ticket creation | 8 | Dynamic fields / pickers | ~72% | Review categorisation | Medium |
| Video / content publishing upload | 12 | Dynamic, multi-screen | ~61% | Review before publish | Low |
Read the table as a queue, not a verdict. The three high-fit rows are short and run against stable portals, so the compound-failure tax is small and a light spot-check is enough. The medium rows are worth recording but keep a human in the loop on the value-bearing step — the dollar amount on an expense claim, the category on a ticket. The video-upload row is deliberately the cautionary case: at a dozen steps across multiple dynamic screens, an unattended replay is more likely to derail, so use it as an assistant that drafts the upload for a human to finish and publish.
None of these per-run figures are promises — they are a planning heuristic. If you want a structured way to score and sequence automations like this across a department, our CRM and workflow automation engagements start with exactly this kind of task triage before any recording happens. For the broader build-versus-buy decision, our AI transformation work frames where agentic automation fits against existing tooling.
06 — Where It FitsRecord & Replay versus RPA and no-code tools.
Record & Replay does not replace every automation tool — it slots into a crowded field alongside RPA macros, no-code platforms like Zapier, and hand-written scripts. The right way to position it is by who can build with it, how it handles change, and where it integrates. The matrix below summarises when each approach wins.
Record & Replay
No technical skill needed — you demonstrate the task. Tolerates small UI changes because it interprets intent at replay. Handles variable inputs and reaches across apps via Computer Use, browser actions, and plugins. macOS-only and paid-plan-gated at launch.
UiPath / Automation Anywhere
Mature, enterprise-governed, and deterministic — but captures coordinates and selectors that break on UI changes, and typically needs specialist developers to build and maintain. Strong where interfaces are frozen and audit trails are mandatory.
Zapier / Make
Best when the systems involved expose clean APIs and webhooks — connections are robust and run server-side without a desktop. Weak when a task only exists in a GUI with no API, which is exactly the gap Record & Replay targets.
Bespoke code
Maximum control and lowest marginal run cost, but the highest skill bar and the most maintenance. Justified for high-volume, high-value pipelines where a brittle demonstration or a per-task SaaS fee would not hold up.
The cheapest way to teach an agent is starting to be demonstration, not instruction. This aligns with how knowledge actually exists — in muscle memory, not in documentation.— eesel AI analysis, June 2026
There is a quietly important portability angle here. SKILL.md is an open format adopted across several major AI coding tools, which means a skill is not necessarily trapped inside Codex. In principle a skill authored in one compatible agent can be read by another — a useful hedge for teams that do not want to bet a library of recorded workflows on a single vendor. Treat cross-agent execution as promising rather than guaranteed, and test any skill in the target agent before relying on it. For the wider landscape, our look at how Codex stacks up against Claude Code, Cursor, and Replit sets the context, and the total cost of ownership for agent-based automation versus low-code tools like Zapier is worth running before you commit a department to one approach.
07 — Launch LimitsThe catches: platform, geography, and prerequisites.
Three constraints define who can use Record & Replay today. First, it is macOS-only at launch — there is no stated Windows timeline, and while Computer Use itself supports Windows, Record & Replay does not. Second, it requires Computer Use to be available and enabled; enterprise admins can disable both via a [features].computer_use = false flag in a requirements.toml managed-configuration file. Third, and most consequential for European readers, it is unavailable in the European Economic Area, the United Kingdom, and Switzerland.
The geographic exclusion is notable because of its timing. Computer Use became available to EEA, UK, and Swiss users on June 16, 2026 — two days before Record & Replay shipped — yet that expansion did not extend to Record & Replay. OpenAI has not stated a reason. It is worth noting only that the EU AI Act’s Article 50 transparency obligations, which cover agentic AI systems, are scheduled to take effect on August 2, 2026; a staged regional rollout of the most autonomous features would be consistent with that timeline, though OpenAI has not cited it as the cause. Read these as launch conditions, not permanent policy.
08 — RolloutA practical way to start.
If your team is on macOS, outside the excluded regions, and on a qualifying plan, the sensible rollout is small and measured. Start with one high-fit task from the matrix — a short, stable workflow on a static portal — and record it against a test account with no real credentials on screen. Read the generated SKILL.md, tighten its description and its verification step, then replay it a handful of times with different inputs before you trust it with anything that matters.
From there, expand by reliability, not by enthusiasm. Add the next shortest, most stable task; keep a human checkpoint on any step that moves money, sends an external message, or cannot be undone; and resist the urge to record the twelve-step publishing pipeline until you have seen the short skills hold up over weeks, not hours. The compound-failure math is unforgiving on long chains, so length is the variable to control.
Pick one high-fit task
Choose a short, stable workflow from the matrix — a time-off request or a recurring report download. Short and stable means the compound-failure tax is small enough that a light human spot-check is sufficient.
Record on a test account
Use realistic but non-sensitive data, with no live credentials on screen. Keep the session well under the 30-minute cap. A tight recording produces a cleaner, more reliable SKILL.md.
Edit and verify the skill
Open the SKILL.md, sharpen its description so Codex triggers it correctly, and make sure the verification step actually confirms success. Replay several times with varied inputs before trusting it.
Expand by reliability
Add tasks in ascending order of step count, keeping a human checkpoint on anything irreversible, financial, or customer-facing. Hold off on long, dynamic flows until short skills have proven stable over weeks.
Run this way, Record & Replay is less a replacement for staff than a way to lift the most repetitive minutes out of their day — and the teams that benefit most are the ones who treat it as supervised automation with a clear rollback, not a fire-and-forget robot. If you want help scoping which workflows to record, building the human checkpoints, and wiring the reliable ones into a broader operating system, that is the kind of program our agentic delivery engagements are built around. If you would rather sequence it yourself, our 30/60/90-day rollout plan for agentic workflow automation lays out the same measured cadence step by step.
09 — ConclusionA genuine step, with honest edges.
Show it once is real — but short, stable, and supervised is the version that works.
Record & Replay is the most concrete sign yet that demonstration, not instruction, is becoming the cheapest way to teach an agent. By capturing intent in natural language rather than coordinates, it sidesteps the brittleness that defeated decades of RPA, and by emitting an open, editable SKILL.md it keeps the result inspectable and at least partly portable across tools — a meaningful advance over the macro recorders it resembles.
The honest framing is the right one. At launch it is macOS-only, unavailable across the EEA, UK, and Switzerland, and gated behind Computer Use and a paid plan — current conditions, not stated as permanent. And the reliability picture, anchored by the 66% OSWorld benchmark and the way error compounds across steps, means long unattended workflows remain a poor bet. The value lives in short, stable, supervised tasks.
The forward read is simpler than the hype. As more knowledge workers — already the fastest-growing slice of Codex’s 5-million-plus weekly users — start recording the repetitive minutes of their day, the winners will be the teams that pick the right tasks, keep a human on the value-bearing steps, and expand by measured reliability rather than by ambition. Show it once is genuinely useful. Show it once, verify it twice, and supervise the rest is the version that holds up.