Grok Imagine 1.5 is an AI image-to-video model that xAI shipped as an API preview on June 3, 2026 — and for marketing and creative teams, it reframes a familiar problem. A campaign already owns hundreds of still assets: product shots, brand photography, key art. The new model turns any one of them into a moving, sound-on clip from a single natural-language prompt, at a per-second rate measured in cents.
That is the part worth slowing down on. Most coverage of the June 3 release reads it as an AI-model release note — resolution ceilings, rate limits, leaderboard Elo. This guide reads it as a production decision. Where does the creative judgement live when the camera move is a sentence? What does a finished, social-ready ad clip actually cost once you account for audio and image input? And where do the brand-safety guardrails bite when generated video runs as paid media?
Below: what the preview actually ships, an honest per-second cost comparison against Runway, Kling, and Veo, the still-to-shot workflow step by step, the Hotshot acquisition that produced the model, and the disclosure rules that govern running AI video in ads. Every figure traces to a primary source where one exists, and is softened where it does not.
- 01An API preview, not a consumer launch.Grok Imagine 1.5 shipped June 3, 2026 as an image-to-video preview on the xAI API. Treat capabilities, pricing, and availability as provisional — a wider consumer rollout was not confirmed at the time of writing.
- 02Audio is bundled into the per-second rate.xAI lists $0.08/sec at 480p and $0.14/sec at 720p, with native audio generation included at no extra charge and a $0.01 image input cost. That bundling is the differentiator most cost comparisons miss.
- 03It debuted at number one on the image-to-video arena.Artificial Analysis, an independent benchmarking site, placed Grok Imagine Video 1.5 first on its image-to-video arena on retrieval, ahead of Runway, Kling, Seedance, and Veo. Arena scores are live and move daily.
- 04The workflow is still asset to chained shots.Feed one image plus a motion prompt, then chain multiple shots into a longer scene with a consistent look. For brands, that maps directly onto turning owned photography into short-form video.
- 05Brand-safe use means disclosure, not just licensing.xAI's policy permits commercial use of outputs but prohibits deceptive attribution and using real people's likenesses without consent. Running generated video as paid media carries the same disclosure floor as any AI-made creative.
01 — What ShippedA single image, a prompt, and a scene in motion.
The release is grok-imagine-video-1.5-preview, exposed through the xAI API. Per xAI's own description, the model takes a single starting frame plus a natural-language prompt and animates the scene — camera moves, atmosphere, and physics — while staying faithful to the input image's lighting and detail. This is an image-to-video model specifically: it animates an existing still rather than generating a clip from text alone. xAI maintains a separate text-to-video endpoint; the two should not be conflated.
For a brand team, that input constraint is a feature, not a limitation. Text-to-video starts from nothing and hopes the brand shows up; image-to-video starts from your actual product shot, your actual key art, your approved photography — and adds motion on top. The creative ground truth is locked before the model runs.
One still asset
A single starting frame — a product photo, brand image, or key-art still. The model preserves the source lighting and detail rather than reinventing the scene from scratch.
A motion prompt
Plain-language direction for camera move, pacing, and sound design, with resolution and clip length set separately. The prompt is where the creative decision now lives.
A sound-on clip
A rendered clip at 480p or 720p with native audio generated in the same pass — unusual among image-to-video APIs, where audio is typically a separate step or omitted entirely.
grok-imagine-video-1.5-2026-05-30). Per xAI's documentation: maximum output is 720p (480p also supported), audio generation is included in every generation, and the API rate limit is 60 requests per minute. Because this is a preview, do not assume it is production-locked or available on a consumer tier — verify the live model card before building a campaign on it.Grok Imagine 1.5 supports multi-shot sequencing: stage each frame, animate it, and chain shots into longer scenes that hold a consistent look across an entire project. That single capability is what moves the model from a novelty clip generator to a plausible piece of ad-production infrastructure — a 6-second product teaser is rarely one shot, and brand consistency across cuts is the whole game.
02 — The PricingThe number that matters: audio is in the rate.
xAI lists the preview at $0.08 per second at 480p and $0.14 per second at 720p, with a $0.01 image input cost. The headline figure is cheap on its own. The figure that actually changes the production math is the one buried in the docs: audio generation is included at no additional charge, which is unusual among the major image-to-video APIs.
Grok Imagine 1.5 cost build-up · brand ad clip
Source: xAI Grok Imagine video docs (preview, June 3, 2026)Why does bundled audio matter so much to the cost-per-asset calculation? Because a silent clip is not a finished ad. On a stack where video and voice are billed separately, a 15-second spot needs a soundtrack, a voiceover, or sound design layered on afterward — a second tool, a second invoice, a second round of approvals. When audio renders in the same pass, the per-second rate is closer to the true cost of a deliverable, not just the cost of moving pixels.
One honest caveat: this is preview pricing, and the "audio included" line is vendor-stated. If the rate card changes when the model reaches general availability, the differentiator can narrow or disappear. The right posture is to treat the current numbers as a snapshot — run a small batch, measure your real cost-per-finished-clip, and re-check the live pricing page before any scaled commitment.
"Give it a starting frame and a prompt describing the motion, and it animates the scene, including camera moves, atmosphere, and physics, while staying faithful to your source image."— xAI, Grok Imagine 1.5 announcement
03 — The WorkflowFrom owned still asset to chained shots.
The reason this release matters to a creative team is not the model card — it is the workflow the model card implies. Here is the concrete production path, and crucially, where the human creative decision sits at each step. The model removes the rendering labor; it does not remove the judgement.
Pick the still
Start from an approved, on-brand asset — a product shot, hero image, or key-art frame. The model preserves its lighting and detail, so asset selection is the first creative decision and the one that anchors brand fidelity.
Prompt the motion
Write natural-language shot direction: the camera move, the pacing, the sound design. This is where art direction now lives — the prompt is the storyboard, and the difference between a generic pan and a brand-right move is the wording.
Chain the shots
Stage and animate additional frames, then sequence them into a longer scene with a consistent look. A 6-second teaser is rarely one shot — multi-shot sequencing is what makes a real ad rather than a single moving image.
Read across those three steps and the pattern is clear: the model collapses the slow, expensive middle — the actual rendering of motion — while leaving the two ends that carry brand risk firmly with the team. Asset choice and prompt direction are creative; sequencing and final cut are editorial. A team that treats Grok Imagine 1.5 as a faster camera, not an autonomous creative director, gets the cost benefit without surrendering the brand. For teams building this into a repeatable pipeline, our content engine work is exactly this kind of operationalization — turning a capable model into a governed, on-brand production line.
It is also worth situating this against the broader market. Grok Imagine is one entrant in a fast-moving field; for the wider context of how the leading AI video generators compare across Runway, Kling, and Veo, and for the short-form distribution angle on AI video creation for YouTube Shorts and social brand content, our prior coverage maps the landscape this release slots into.
04 — Cost ComparisonWhat a finished clip actually costs.
The table below compares image-to-video API rates for brand ad production. Two warnings before you read it. First, only the Grok Imagine figures come from a primary vendor page on the retrieval date; the Runway, Kling, and Veo numbers are drawn from vendor and secondary sources that move week to week — treat them as directional ranges, not anchors, and verify each provider's live pricing page before you budget. Second, the "audio" column is what reframes the comparison: a cheaper per-second rate with audio sold separately is not necessarily cheaper per finished, sound-on clip.
Grok Imagine 1.5Runway Gen-4.5Kling 3.0Google Veo 3.1| Provider (i2v) | Approx. rate & audio | Brand read |
|---|---|---|
Grok Imagine 1.5 | $0.08/sec (480p) · $0.14/sec (720p) · audio included | Source-faithful motion from an owned still, multi-shot chaining, and bundled audio in the rate. Debuted #1 on the Artificial Analysis image-to-video arena. Preview status is the main caveat. |
Runway Gen-4.5 | ~$0.15/sec · audio handled separately | Long track record and deep creative-control tooling. Audio is not bundled into the i2v rate, so a finished sound-on clip carries an added step. Verify live credit pricing before budgeting. |
Kling 3.0 | ~$0.075–$0.10/sec · audio via separate credits | Often the lowest sticker rate for silent clips and strong on temporal consistency. The savings can erode once audio is added back in. Cross-check the official Kling API pricing. |
Google Veo 3.1 | ~$0.03/sec (Lite) to ~$0.40/sec (Quality) | Tiered: a very low-cost Lite entry without audio up to a premium Quality tier. Deep Google Cloud ecosystem fit. Verify rates directly on Google's pricing pages before committing. |
Here is the original read most coverage skips. On a naive per-second-of-pixels basis, Grok Imagine 1.5 at $0.14 for 720p looks mid-pack — cheaper than Veo Quality, comparable to Runway, pricier than Kling's floor and Veo Lite. But brands do not ship silent video. Once you require a sound-on deliverable, the providers that bill audio separately need that cost added back, and the ranking tightens considerably. The arena result reinforces the point: a top-of-table quality position at a rate that already includes audio is a genuinely competitive cost-per-finished-asset, not just a cheap per-second clip.
Project that forward and the strategic shift is about volume, not hero spots. When a sound-on, on-brand 15-second clip lands near a couple of dollars in compute, the constraint on social video output stops being budget and starts being creative direction and approval throughput. The teams that win this cycle will not be the ones with the biggest production budget; they will be the ones with the tightest prompt libraries, the clearest brand guardrails, and the fastest review loops.
05 — The OriginThe Hotshot acquisition behind the model.
Grok Imagine 1.5 did not appear from nowhere. It is a direct product of xAI's acquisition of Hotshot, a San Francisco-based AI video generation startup, in March 2025. Hotshot had built three video foundation models before being folded into xAI, and that team and its model lineage are the engine underneath the Grok Imagine video pipeline today.
The detail matters for anyone deciding how much weight to put on a preview. A first-place arena debut reads very differently when it comes from a team with a multi-year head start in video generation than when it comes from a standing start. The capability is not a lucky one-off; it is the visible result of an acquired research program, now distributed at xAI's API scale and price point.
06 — The Other HalfThe voice half of the creative stack.
On the same day as the video preview, xAI announced a partnership making Grok the default voice engine for Vapi's core voices. Read alone, that looks like a developer-platform story. Read alongside the video release, it is the second half of a creative production stack: motion on one side, voice on the other, both priced to undercut the most expensive suites on the market.
For a brand, the implication is workflow consolidation. Custom voice cloning via the Grok Voice API is positioned for narration, podcasts, advertising, and voiceover use cases — the same deliverables that the video model feeds. A short-form ad that needs a consistent brand voice across spots can, in principle, source both the visuals and the voiceover from the same vendor stack rather than stitching together three tools.
Voice agents on Vapi
xAI states Grok now serves as the default engine for Vapi's core voices, used across over 2.5 million voice agents. (Reporting around the platform has cited a higher figure since; the count keeps growing.)
Vapi Series B
TechCrunch reported Vapi raised a $50M Series B in May 2026 at a $500M post-money valuation, with enterprise adopters including Amazon Ring. Scale context for the partnership underneath the voice layer.
Motion plus voice
Grok Voice cloning targets narration, advertising, and voiceover — the same outputs the image-to-video model produces visuals for. One vendor stack for a sound-on brand clip rather than three separate tools.
A note on naming, because it trips people up: the consumer voice mode inside the Grok app and the developer Grok Voice Agent API are distinct products built on the same underlying voice stack. The Vapi partnership runs on the developer API and text-to-speech endpoint, not the chatbot's voice mode. For brand work, the developer side is the relevant one — it is what supports custom voice cloning for advertising.
07 — Brand SafetyCommercial rights come with a disclosure floor.
The good news for brands: xAI's Acceptable Use Policy explicitly allows commercial use of generated outputs, including video. The constraint is in the conditions, not the permission. The policy prohibits copyright and IP violation, depicting real persons without their consent, and — most relevant for advertising — deceptive attribution of AI-generated outputs. xAI also requests attribution to Grok per its brand guidelines.
Translate that into the realities of running paid media. If a generated clip implies a real person endorses a product, or passes off AI video as authentic footage in a way that misleads, you are outside the policy and likely outside the ad platforms' own synthetic-media rules. The regulatory floor here is the same one that governs all AI-generated creative: transparency about AI use, and no unconsented likenesses. This is not a Grok-specific tax — it is the cost of doing business with generative video at all, and it belongs in the brief, not the post-mortem.
08 — The DecisionWhen to reach for it, and when not to.
A new model at number one is not a mandate to switch everything. The useful question is per-workload: which jobs does an image-to-video preview at this price actually fit, and which should stay on a proven, GA-stable tool?
High-volume sound-on clips from owned stills
Turning existing product and brand photography into a steady stream of short social spots is the sweet spot: bundled audio, cheap per-second rate, multi-shot chaining. Run a small batch, measure cost-per-finished-clip, then scale.
Flagship spots with zero tolerance for variance
A preview model is the wrong place to stake a tentpole campaign. Capabilities can change before GA, and high-stakes hero work wants a tool you have stress-tested. Keep flagship production on proven ground for now.
Budget-sensitive bulk where audio is needed
When clips must ship with sound, the bundled-audio rate often beats a cheaper silent provider once you add audio back. Compare on cost-per-finished-clip, not per-second-of-pixels, before defaulting to the lowest sticker price.
Sectors with strict ad-disclosure exposure
Generated video in finance, health, or political adjacency carries real disclosure and likeness risk. The model is usable, but the diligence — consent, transparency, documentation — has to be airtight before it runs as paid media.
For most marketing teams, the right first move is small and measurable: take five to ten owned stills, write motion prompts, generate sound-on clips, and put the real numbers next to your current production cost and turnaround. The preview status means you should not rebuild your whole pipeline around it yet — but it also means the cost of finding out whether it fits your workflow is genuinely trivial. If you want help standing up that evaluation as a governed, repeatable process, our social media and content engine engagements are built for exactly this kind of test-and-operationalize loop.
09 — ConclusionA faster camera, not a new creative director.
The constraint on social video output just moved from budget to creative direction.
Grok Imagine 1.5 is, at its core, a production-economics story. Animating an owned still into a sound-on, multi-shot clip for cents per second — with a first-place arena debut and audio bundled into the rate — changes the math on how much brand video a team can ship, not whether they can ship it at all.
The honest framing keeps two caveats in view. It is a preview, so pricing and availability can move before general availability, and the "audio included" differentiator is vendor-stated and worth re-checking. And the competitor comparison rests on rates that shift weekly — the only figures here from a primary page are Grok's own. The right response is not to switch wholesale; it is to run a small, measured pilot on owned assets and read your real cost-per-finished-clip.
The broader signal is the one worth carrying forward. When a sound-on, on-brand clip lands near the cost of a coffee, the bottleneck on social video stops being money and becomes prompt craft, brand guardrails, and approval speed. The model is a faster camera. The creative direction — still — is the job.