AI DevelopmentNew Release12 min readPublished June 24, 2026

Demo a workflow once · SKILL.md output · macOS-only at launch

Codex Record & Replay: Show It Once, Skip the Script

On June 18, 2026, OpenAI added Record & Replay to the Codex desktop app (version 26.616). You perform a macOS workflow once, Codex watches, and it writes a reusable, editable SKILL.md you can run on demand with new inputs. This guide covers what shipped, how it differs from old macro recorders, the reliability math that decides which tasks are worth recording, and the launch catches.

DA
Digital Applied Team
Senior strategists · Published Jun 24, 2026
PublishedJun 24, 2026
Read time12 min
Sources6 primary
Codex app at launch
26.616
shipped Jun 18, 2026
Codex weekly active users
5M+
as of Jun 2, 2026
6x since Feb
Max recording length
30min
per session
Knowledge-worker share
~20%
of Codex users
3x dev growth

OpenAI Codex Record & Replay lets you demonstrate a macOS workflow one time — Codex watches, then writes a reusable SKILL.md you can run on demand with new inputs. Announced on June 18, 2026 in the Codex desktop app (version 26.616), it turns a one-off demo into an inspectable, editable automation, no scripting required.

The reason this matters is not that screen recording is new — it is that the recorder now captures intent in natural language rather than pixel coordinates. That shift is what separates Record & Replay from a decade of brittle macro tools, and it is why knowledge workers, not just developers, are the audience OpenAI is courting.

This guide covers exactly what shipped, how the demo-to-skill loop works, why the architecture differs from traditional RPA, the reliability math that decides which tasks are safe to automate, six office tasks worth recording first, and the launch catches — macOS-only and a regional exclusion — that European teams need to know about before they plan around it.

Key takeaways
  1. 01
    One demo becomes a reusable skill.You perform a workflow once on macOS; Codex turns the recording into a SKILL.md file that specifies when to use the workflow, what inputs it accepts, the steps, and how to verify success. You then run it on demand with new inputs.
  2. 02
    It captures intent, not coordinates.Recording stores actions and window content as JSON that preserves semantic intent rather than pixel positions. At replay, Codex interprets those steps against the current screen, which is why it survives small UI changes that break classic macro recorders.
  3. 03
    SKILL.md is a shared, portable format.The skill file is a human-readable, LLM-interpretable document. SKILL.md is an open format adopted across several major AI coding tools, so skills are not locked to a single vendor — a genuinely useful property when choosing a toolchain.
  4. 04
    Reliability compounds across steps.AI agents averaged 66% success on the OSWorld benchmark in the 2026 AI Index. Short, stable workflows replay well; long ones decay fast because per-step error multiplies. Pick tasks accordingly.
  5. 05
    Launch limits are real but framed as current.At launch Record & Replay is macOS-only and unavailable in the EEA, UK, and Switzerland, and it requires Computer Use. These are current constraints, not stated as permanent — verify your platform and region before planning.

01What ShippedA demo-to-skill recorder, live in the Codex desktop app.

Record & Replay shipped on June 18, 2026 in the Codex desktop app at version 26.616. The premise is simple: you start a recording, perform a recurring task on your Mac — filing an expense report, submitting a time-off request — and stop the recording when you’re done. Codex turns that single demonstration into an inspectable, editable skill it can run again later with different inputs. The same 26.616 update also added bulk actions for automation run history and the ability to hand a thread off between local and remote hosts.

The feature is available to ChatGPT Plus, Pro, Business, Enterprise, and Edu subscribers. The Codex app itself is free to download, but Record & Replay requires a paid plan — Plus at $20/month, Pro at $200/month, or one of the Business, Enterprise, or Edu tiers. You trigger a recording from the Plugins panel in the desktop app (Plugins, then the “+”, then “Record a skill”) and replay it by referencing the skill name in a new thread.

Record
Show it once
macOS Screen Recording + Accessibility APIs

You control when recording starts and stops. Codex captures your actions and window content, storing the workflow as JSON that preserves semantic intent rather than pixel coordinates. Maximum session length is 30 minutes.

Plugins → + → Record a skill
Replay
Run it on demand
Computer Use + browser actions + plugins

Reference the skill by name in a new thread with new inputs. Codex uses whatever tools are available — Computer Use for desktop GUI control, browser actions for web tasks, and installed plugins for integrations like Slack, Gmail, Notion, and Salesforce.

Inspectable, editable SKILL.md
Launch snapshot
Record & Replay launched June 18, 2026 in the Codex desktop app, version 26.616. It is macOS-only at launch, requires Computer Use to be available and enabled, and is unavailable to users in the European Economic Area, the United Kingdom, and Switzerland. These are the conditions as of launch — treat them as current, and re-check OpenAI’s documentation for your platform and region before planning a deployment.

Context for the timing: OpenAI Codex surpassed 5 million weekly active users as of June 2, 2026, up roughly 6x since the desktop app launched in February 2026. Knowledge workers now make up about 20% of Codex users and are reportedly growing more than 3x faster than developer users. Record & Replay is squarely aimed at that growth: the people most likely to record a repetitive office task are not the engineers who could script it, but the operations, marketing, and finance staff who currently do it by hand.

02How It WorksOne demonstration becomes an editable SKILL.md.

The artifact Record & Replay produces is a SKILL.md file: a human-readable, LLM-interpretable instruction document. It specifies when to use the workflow, what variable inputs it accepts, what steps to follow, and how to verify successful completion. Because it is a plain document, you can open it, read it, and edit it by hand — fix a mislabelled step, tighten a verification check, or generalise an input — without re-recording from scratch.

Under the hood, recording uses the macOS Screen Recording and Accessibility APIs to capture both your actions and the on-screen window content. That data is stored as JSON that preserves semantic intent — “click the Submit button,” not “click at pixel 412, 880.” At replay time Codex loads the skill and interprets those steps against the current state of the screen, choosing tools as it goes.

How Codex decides when to run a skill

A SKILL.md file requires two frontmatter fields: a name (the skill identifier) and a description (the trigger signal Codex uses to decide when to invoke the skill). To keep its context window lean, Codex initially loads only the name, description, and file path of each skill — keeping the initial skills list under roughly 8,000 characters, about 2% of the context window. The full SKILL.md content is only loaded once a skill is actually selected.

You can invoke a skill explicitly with a $skill-name reference or the /skills command, or implicitly when Codex matches your task description to a skill’s description field. Skills resolve through a storage hierarchy in priority order: a repo’s .agents/skills in the current directory, then parent and root repo folders, then a user-level $HOME/.agents/skills, then an admin-managed /etc/codex/skills, then the system-bundled defaults.

Required frontmatter
name + description
2fields

Every SKILL.md needs a name (the identifier) and a description (the trigger signal Codex matches against your task). The description is what makes a skill discoverable, so write it as a clear when-to-use sentence.

Discovery layer
Initial context cost
Lazy skill loading
~8Kchars

Codex lists only name, description, and path for each skill — keeping the initial listing near 8,000 characters, about 2% of context. Full step content loads only when a skill is selected, so many skills stay cheap.

~2% of window
Max recording
Per-session ceiling
30min

A single recording session caps at 30 minutes. That is plenty for most office tasks and is itself a nudge toward short, well-scoped workflows — which, as the next section shows, are also the ones that replay most reliably.

Hands-on tested

03ArchitectureWhy this isn’t the macro recorder you remember.

Programming by demonstration is not a new idea — researchers have chased it since the mid-1980s. Traditional robotic process automation (RPA) tools such as UiPath and Automation Anywhere are its commercial descendants, and they share a well-known weakness: they capture pixel coordinates and rigid selectors that break the moment a button moves, a layout reflows, or a web app ships a redesign. Generalising from a single demonstration to a slightly different screen was the problem that defeated those systems.

Record & Replay routes that generalisation through a language model instead. The SKILL.md captures intent in natural language, and at replay Codex interprets that intent against whatever the screen actually looks like now. That is the meaningful architectural distinction: the recorder stopped storing where you clicked and started storing what you were trying to do.

OpenAI Codex shipped Record & Replay. This is the first Codex update that makes employee replacement easy to picture. You perform a workflow once on your Mac. Codex watches, turns it into a reusable Skill, then runs the same workflow next time with new inputs.— Kai (@hqmank), community observer, June 20, 2026
The distinction that matters
The shift from coordinate-level to intent-level recording is what makes this more than a smarter macro tool. OpenAI has routed generalisation through its language model rather than through the rule-based heuristics that defeated prior systems — a meaningful architectural distinction, in TechTimes’ analysis of the launch. It is also why a recorded skill can tolerate the small UI drift that would have broken a classic RPA macro.

04ReliabilityThe reliability math that decides what to automate.

Before recording anything, it helps to understand where these agents still fail. According to the Stanford 2026 AI Index, AI agents achieved an average 66% task success rate on the OSWorld benchmark — up sharply from 12% a year earlier, but still a 34% failure rate in unsupervised desktop automation. That headline number is per task; the more useful lesson is what happens to multi-step workflows.

Reliability compounds. As an illustrative example cited in coverage of the launch, an agent that is 85% reliable on each individual step of a 10-step workflow succeeds end-to-end only about 20% of the time, because the per-step error rate multiplies across the chain. (That figure comes from an analysis on Temporal.io’s blog, not a peer study, so treat it as illustrative rather than measured.) The takeaway is the same either way: short, stable workflows are far safer candidates than long, fragile ones.

Estimated per-run success vs workflow length · illustrative

Illustrative: per-run estimate = 0.96^(step count). Calibrated so a 10-step flow lands on the OSWorld 66% average (Stanford 2026 AI Index). Not a measured per-task rate.
3-step workflowShort, stable · e.g. time-off request
~88%
5-step workflowRecurring export / report download
~82%
6-step workflowExpense report submission
~78%
8-step workflowCRM / Jira ticket creation
~72%
10-step workflowOSWorld benchmark average
~66%
12-step workflowVideo / content publishing upload
~61%

The bars above are a deliberately simple model, not measured data: they assume each step replays at about 96% reliability and compound that across the workflow, which by design lands a 10-step flow on the OSWorld 66% average. The point is the shape of the curve, not the exact percentages — every additional step you record is another place the replay can stumble. That is the lens to bring to the six tasks below.

Read the numbers honestly
The 66% OSWorld figure is a third-party benchmark result reported in the Stanford 2026 AI Index, and the compound-failure illustration originates on Temporal.io’s blog — neither is an OpenAI claim about Record & Replay specifically. Use them as a planning lens for task selection, not as a promised success rate. Keep a human checkpoint on anything financial, irreversible, or customer-facing.

05Task SelectionSix office tasks worth recording first.

OpenAI’s own examples — filing an expense report, submitting a time-off request — are a good start, but the reliability curve suggests an ordering. The matrix below scores six common office workflows by step count, interface stability, an illustrative per-run estimate (derived from the same 0.96-per-step model above), where a human still needs to check, and an overall fit rating. Record the high-fit ones first; treat the low-fit ones as assisted, not unattended.

Six office workflows scored for Record & Replay fit by step count, interface stability, illustrative per-run success estimate, human checkpoints, and overall fit rating. Per-run estimates are derived from an illustrative 0.96-per-step compound model and are not measured rates.
WorkflowTypical stepsInterface stabilityEst. per-runHuman still neededFit
Time-off / HR system request3Static portal~88%Spot-check datesHigh
Attendance / time-logging entry4Static portal~85%Periodic auditHigh
Recurring data export / report download5Mostly static dashboard~82%Verify file landedHigh
Expense report submission6Form + receipt upload~78%Confirm amountsMedium
CRM / Jira ticket creation8Dynamic fields / pickers~72%Review categorisationMedium
Video / content publishing upload12Dynamic, multi-screen~61%Review before publishLow

Read the table as a queue, not a verdict. The three high-fit rows are short and run against stable portals, so the compound-failure tax is small and a light spot-check is enough. The medium rows are worth recording but keep a human in the loop on the value-bearing step — the dollar amount on an expense claim, the category on a ticket. The video-upload row is deliberately the cautionary case: at a dozen steps across multiple dynamic screens, an unattended replay is more likely to derail, so use it as an assistant that drafts the upload for a human to finish and publish.

None of these per-run figures are promises — they are a planning heuristic. If you want a structured way to score and sequence automations like this across a department, our CRM and workflow automation engagements start with exactly this kind of task triage before any recording happens. For the broader build-versus-buy decision, our AI transformation work frames where agentic automation fits against existing tooling.

06Where It FitsRecord & Replay versus RPA and no-code tools.

Record & Replay does not replace every automation tool — it slots into a crowded field alongside RPA macros, no-code platforms like Zapier, and hand-written scripts. The right way to position it is by who can build with it, how it handles change, and where it integrates. The matrix below summarises when each approach wins.

Demonstration
Record & Replay

No technical skill needed — you demonstrate the task. Tolerates small UI changes because it interprets intent at replay. Handles variable inputs and reaches across apps via Computer Use, browser actions, and plugins. macOS-only and paid-plan-gated at launch.

Pick for tacit office tasks
RPA macros
UiPath / Automation Anywhere

Mature, enterprise-governed, and deterministic — but captures coordinates and selectors that break on UI changes, and typically needs specialist developers to build and maintain. Strong where interfaces are frozen and audit trails are mandatory.

Pick for frozen, regulated UIs
No-code automation
Zapier / Make

Best when the systems involved expose clean APIs and webhooks — connections are robust and run server-side without a desktop. Weak when a task only exists in a GUI with no API, which is exactly the gap Record & Replay targets.

Pick when APIs exist
Custom scripts
Bespoke code

Maximum control and lowest marginal run cost, but the highest skill bar and the most maintenance. Justified for high-volume, high-value pipelines where a brittle demonstration or a per-task SaaS fee would not hold up.

Pick for high-volume pipelines
The cheapest way to teach an agent is starting to be demonstration, not instruction. This aligns with how knowledge actually exists — in muscle memory, not in documentation.— eesel AI analysis, June 2026

There is a quietly important portability angle here. SKILL.md is an open format adopted across several major AI coding tools, which means a skill is not necessarily trapped inside Codex. In principle a skill authored in one compatible agent can be read by another — a useful hedge for teams that do not want to bet a library of recorded workflows on a single vendor. Treat cross-agent execution as promising rather than guaranteed, and test any skill in the target agent before relying on it. For the wider landscape, our look at how Codex stacks up against Claude Code, Cursor, and Replit sets the context, and the total cost of ownership for agent-based automation versus low-code tools like Zapier is worth running before you commit a department to one approach.

07Launch LimitsThe catches: platform, geography, and prerequisites.

Three constraints define who can use Record & Replay today. First, it is macOS-only at launch — there is no stated Windows timeline, and while Computer Use itself supports Windows, Record & Replay does not. Second, it requires Computer Use to be available and enabled; enterprise admins can disable both via a [features].computer_use = false flag in a requirements.toml managed-configuration file. Third, and most consequential for European readers, it is unavailable in the European Economic Area, the United Kingdom, and Switzerland.

The geographic exclusion is notable because of its timing. Computer Use became available to EEA, UK, and Swiss users on June 16, 2026 — two days before Record & Replay shipped — yet that expansion did not extend to Record & Replay. OpenAI has not stated a reason. It is worth noting only that the EU AI Act’s Article 50 transparency obligations, which cover agentic AI systems, are scheduled to take effect on August 2, 2026; a staged regional rollout of the most autonomous features would be consistent with that timeline, though OpenAI has not cited it as the cause. Read these as launch conditions, not permanent policy.

Security note
Recording captures whatever is visible on screen. OpenAI’s documentation recommends using realistic but non-sensitive test data and avoiding exposing credentials, tokens, or customer records during capture — anything on screen during a recording can end up reflected in the skill. Record against a sandbox or test account where you can, and review the generated SKILL.md before you save or share it.

08RolloutA practical way to start.

If your team is on macOS, outside the excluded regions, and on a qualifying plan, the sensible rollout is small and measured. Start with one high-fit task from the matrix — a short, stable workflow on a static portal — and record it against a test account with no real credentials on screen. Read the generated SKILL.md, tighten its description and its verification step, then replay it a handful of times with different inputs before you trust it with anything that matters.

From there, expand by reliability, not by enthusiasm. Add the next shortest, most stable task; keep a human checkpoint on any step that moves money, sends an external message, or cannot be undone; and resist the urge to record the twelve-step publishing pipeline until you have seen the short skills hold up over weeks, not hours. The compound-failure math is unforgiving on long chains, so length is the variable to control.

Step 1
Pick one high-fit task

Choose a short, stable workflow from the matrix — a time-off request or a recurring report download. Short and stable means the compound-failure tax is small enough that a light human spot-check is sufficient.

Start narrow
Step 2
Record on a test account

Use realistic but non-sensitive data, with no live credentials on screen. Keep the session well under the 30-minute cap. A tight recording produces a cleaner, more reliable SKILL.md.

Protect credentials
Step 3
Edit and verify the skill

Open the SKILL.md, sharpen its description so Codex triggers it correctly, and make sure the verification step actually confirms success. Replay several times with varied inputs before trusting it.

Inspect, don't assume
Step 4
Expand by reliability

Add tasks in ascending order of step count, keeping a human checkpoint on anything irreversible, financial, or customer-facing. Hold off on long, dynamic flows until short skills have proven stable over weeks.

Grow the library carefully

Run this way, Record & Replay is less a replacement for staff than a way to lift the most repetitive minutes out of their day — and the teams that benefit most are the ones who treat it as supervised automation with a clear rollback, not a fire-and-forget robot. If you want help scoping which workflows to record, building the human checkpoints, and wiring the reliable ones into a broader operating system, that is the kind of program our agentic delivery engagements are built around. If you would rather sequence it yourself, our 30/60/90-day rollout plan for agentic workflow automation lays out the same measured cadence step by step.

09ConclusionA genuine step, with honest edges.

The shape of demonstration-driven automation, June 2026

Show it once is real — but short, stable, and supervised is the version that works.

Record & Replay is the most concrete sign yet that demonstration, not instruction, is becoming the cheapest way to teach an agent. By capturing intent in natural language rather than coordinates, it sidesteps the brittleness that defeated decades of RPA, and by emitting an open, editable SKILL.md it keeps the result inspectable and at least partly portable across tools — a meaningful advance over the macro recorders it resembles.

The honest framing is the right one. At launch it is macOS-only, unavailable across the EEA, UK, and Switzerland, and gated behind Computer Use and a paid plan — current conditions, not stated as permanent. And the reliability picture, anchored by the 66% OSWorld benchmark and the way error compounds across steps, means long unattended workflows remain a poor bet. The value lives in short, stable, supervised tasks.

The forward read is simpler than the hype. As more knowledge workers — already the fastest-growing slice of Codex’s 5-million-plus weekly users — start recording the repetitive minutes of their day, the winners will be the teams that pick the right tasks, keep a human on the value-bearing steps, and expand by measured reliability rather than by ambition. Show it once is genuinely useful. Show it once, verify it twice, and supervise the rest is the version that holds up.

Put demonstration-driven automation to work

Pick the right tasks and keep a human on the value-bearing steps — that is what makes recorded automation actually pay off.

Our team helps businesses pick the right workflows to automate, build the human checkpoints that keep them safe, and wire the reliable ones into a broader operating system — delivered in days, not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Agentic automation engagements

  • Task triage — which workflows are worth recording first
  • Human-checkpoint design for financial and customer-facing steps
  • Skill libraries wired into a broader operating system
  • Build-versus-buy across Codex, RPA, and no-code tools
  • Governance for agentic automation in regulated teams
FAQ · Codex Record & Replay

The questions we get every week.

Record & Replay is a feature in OpenAI's Codex desktop app that lets you demonstrate a macOS workflow once — Codex watches and turns the demonstration into a reusable, editable skill it can run again with new inputs. It was announced on June 18, 2026 in the Codex desktop app at version 26.616. The same update also added bulk actions for automation run history and the ability to hand a thread off between local and remote hosts. You start a recording, perform a recurring task such as filing an expense report or submitting a time-off request, and stop when you're done; Codex produces a SKILL.md file you can inspect, edit, and replay on demand.
Related dispatches

Continue exploring agentic automation.