📑 Table of contents

Grok Imagine Video 1.5 : xAI explodes the image-to-video leaderboard, beats Sora and Veo with native audio — and costs 86% less than Sora 2

Outils IA 🟢 Beginner ⏱️ 14 min read 📅 2026-06-20

Grok Imagine Video 1.5: xAI explodes the image-to-video leaderboard, beats Sora and Veo with native audio — and costs 86% less than Sora 2

🔎 Sora dies, Grok takes the throne

On April 26, 2026, OpenAI shut down the consumer Sora app. On June 17, 2026, xAI launched Grok Imagine Video 1.5 and took the number one spot on the Image-to-Video Arena leaderboard.

Two months. That's all the time it took for xAI to turn an OpenAI strategic void into technical domination over a key segment of video generation.

The timing is no coincidence. Sora 2 Pro cost $30/minute and its API will permanently close on September 24, 2026 according to TechTimes. Creators who had invested in Sora workflows find themselves without a tool and without a clear alternative at OpenAI.

xAI targeted exactly this gap. Grok Imagine Video 1.5 arrives with a triple value proposition: number one quality on the benchmark reference, synchronized native audio in a single pass, and a price of $4.20/minute. That's an 86% reduction compared to Sora 2 Pro, according to Gagadget.


The key points

  • Grok Imagine Video 1.5 is number 1 on the Image-to-Video Arena with an Elo score between 1330 and 1421 depending on the source, ahead of Seedance 2.0 (ByteDance), Veo 3.1 (Google), and the former Sora 2 Pro.
  • The audio is generated in single-pass: no need to go through a separate TTS model. Lip-sync dialogue, sound effects, music — everything comes out in a single request.
  • The pricing disrupts the market: $4.20/minute at xAI compared to $30/min for Sora 2 Pro, and 75 to 87% cheaper than Google Veo 3.1 depending on audio usage.
  • OpenAI discontinued the Sora app in April 2026 and the Sora 2 API will sunset in September 2026. The segment is up for grabs.
  • Available via the xAI API and third-party platforms like Replicate and fal.ai.

Model Type Estimated Price (June 2026, check on site) Native Audio Arena Elo Score
Grok Imagine Video 1.5 Image → Video 720p $4.20/min ($0.14/sec 720p) Yes, single-pass ~1330-1421
Seedance 2.0 Text/Image → Video 720p ~$8-12/min (estimated) No 1454 (t2v global)
Veo 3.1 Audio Text/Image → Video 1080p ~$17-34/min depending on config Yes 1396-1402
Kling 2.0 Pro Text/Image → Video ~$10-15/min (estimated) Partial 1347
Sora 2 Pro (sunset) Text/Image → Video $30/min No N/A (removed)

What really makes the difference with Grok Imagine Video 1.5

Grok Imagine Video 1.5 doesn't just generate video. It solves a concrete problem that all creators know: post-generation audio synchronization.

Native single-pass audio, the real game-changer

Until now, the standard workflow for an AI video clip consisted of three steps: generate the video, generate the audio separately (ElevenLabs, native Grok voice, etc.), and then synchronize the two with a video editing tool.

Grok Imagine Video 1.5 removes steps 2 and 3. The model generates video and audio in a single pass, according to ThePlanetTools. This includes dialogue with lip-sync, contextual sound effects, and background music.

The gain isn't just technical. It's a gain in time, cost, and above all, consistency. A footstep falling at exactly the right moment, a glass being set down with the right sound — that's what separates an amateur render from a professionally usable result.

The Aurora-2 engine under the hood

This model is built on Aurora-2, the multimodal engine introduced by xAI in early 2026. Aurora-2 is also the engine behind the voice reasoning model that Artificial Analysis designated as the first voice reasoning model in early 2026, ahead of Google and Amazon.

Aurora-2's any-to-any architecture explains why xAI went from zero video capabilities to the top of the leaderboard in a matter of months. It's the same engine that powers Google's Gemini Omni on the competitor side — the difference is that xAI chose to aggressively optimize pricing.


The Arena Ranking in Detail: Who Beats Who?

Artificial Analysis's Image-to-Video Arena has become the industry's benchmark standard. The principle: blind human evaluations (Elo) comparing the outputs of different models from the same input image.

The June 2026 Scores

Grok Imagine Video 1.5 reaches an Elo score of around 1330 according to Gagadget and DailyBeirut, and up to 1421 in Artificial Analysis's overall video ranking. The difference is explained by the benchmark scopes: the 1330 score specifically concerns the image-to-video evaluation, while 1421 also incorporates text-to-video evaluations via the parent model grok-imagine-video.

In detail of the updated June 2025 overall video ranking:

  1. Seedance 2.0 (ByteDance) — 1454 Elo
  2. HappyHorse 1.0 (Alibaba) — 1444 Elo
  3. Grok Imagine Video 720p (xAI) — 1421 Elo
  4. Veo 3.1 Audio 1080p (Google) — 1402 Elo
  5. Veo 3.1 Audio (Google) — 1396 Elo

The nuance is important. Seedance 2.0 leads on pure text-to-video, but in image-to-video — the use case most requested by creators — Grok 1.5 takes the lead. This is confirmed by Tesorb: the May 31, 2026 launch via the xAI API propelled the model ahead of Veo 3.1 and Sora 2 Pro in this specific segment.

The Fall of Sora, the Rise of xAI

In January 2026, xAI's Grok Imagine generated 1.245 billion videos in a single month. An unimaginable figure a year earlier, when xAI had no video product. The strategy was clear: volume first via the X (Twitter) integration, then a technical quality upgrade with version 1.5.

Meanwhile, OpenAI was retiring Sora. The standalone app closed on April 26, 2026, and the Sora 2 API is scheduled for sunset on September 24, 2026. Creators who had built pipelines around Sora must migrate. xAI has positioned Grok Imagine Video 1.5 as the natural destination for this migration.


Price comparison: why $4.20/minute changes everything

Pricing is often the deciding factor for mass adoption. And on this point, xAI is not playing in the same league.

Real cost comparison table

The prices below come from fal.ai for Grok and from VidGuru for the Veo comparison.

Model Cost per second (720p) Cost for 5 sec Cost for 15 sec Cost per minute
Grok Imagine Video 1.5 $0.14 $0.70 (+ $0.01 image) $2.10 (+ $0.01 image) ~$4.20
Veo 3.1 with audio ~$0.50-0.80 ~$2.50-4.00 ~$7.50-12.00 ~$17-34
Sora 2 Pro $0.50 $2.50 $7.50 $30.00

Grok is about 75 to 87% cheaper than Veo 3.1 depending on audio usage, and 86% cheaper than Sora 2 Pro. Each input image adds only $0.01 on fal.ai. The cost is strictly linear with duration — no tiers or volume discounts, just total transparency.

What these prices mean in practice

For a YouTube creator producing 10 short clips per week in 720p 5 seconds: with Sora 2 Pro, it was $25/week ($100/month). With Grok 1.5, it's $7/week ($28/month). The difference pays for a Hostinger subscription to host their website.

For a marketing agency generating 100 10-second 720p clips per month: you go from $500/month with Sora to about $70/month with Grok. The ROI is such that it calls into question the very existence of premium pricing in the 720p segment.


Grok 1.5 vs. the competition: who holds up?

Against Google Veo 3.1

Google Veo 3.1 offers 1080p resolution with native audio, which Grok 1.5 does not yet do (capped at 720p according to Morphic). Veo 3.1 remains relevant for high-definition productions.

But the price of Veo 3.1 with audio is massively higher. And in image-to-video at 720p, Grok's Elo score surpasses Veo 3.1 in human evaluations. For 80% of use cases (social media, shorts, digital ads), 720p is more than enough.

Veo 3.1 keeps the advantage in pure text-to-video and 1080p resolution. Grok 1.5 dominates in pricing and image-to-video. It's a market share split that is taking shape, not total domination.

Against Seedance 2.0 (ByteDance)

Seedance 2.0 leads the overall video leaderboard with 1454 Elo, but it is primarily a text-to-video model. Its image-to-video integration is less documented and less optimized than that of Grok 1.5.

Seedance remains the number one choice if you are starting from scratch (pure text). But if you already have a reference image — a character, a product, a storyboard — Grok 1.5 is more consistent and more predictable in the rendering.

Against Kling 2.0 Pro

Kling 2.0 Pro (Kuaishou) scores 1347 Elo, below Grok 1.5. No documented native audio. Mid-range pricing. It's a solid number two choice, but it does not threaten Grok's position in image-to-video.

To follow the evolution of this comparison, our guide Meilleure IA génération vidéo is updated every month with the latest Arena scores.

How to use Grok Imagine Video 1.5 in practice

Via the xAI API

The model has been available directly via the xAI API since May 31, 2026. We need to distinguish between two endpoints:

  • grok-imagine-video-1.5: image-to-video with synchronized audio (the model generating the buzz)
  • grok-imagine-video: text/image/video-to-video (a more general-purpose model)

The xAI API is the most direct route for developers looking to integrate video generation into their own applications. For those building agents or automated pipelines, xAI's approach is reminiscent of what Grok Build does on the coding side — an API-first integration, without any frills.

Via Replicate and fal.ai

For creators who don't want to manage APIs directly, Replicate and fal.ai offer ready-to-use interfaces.

On fal.ai, pricing is transparent: $0.08/sec in 480p, $0.14/sec in 720p. A 5-second clip in 720p costs $0.70, plus $0.01 for the input image. Audio is included in the price if generated.

This is the ideal format for testing and small volumes. For large-scale production, the direct API remains more cost-effective.

Technical limitations to be aware of

Grok Imagine Video 1.5 generates clips up to 15 seconds in 720p 24 FPS. No 1080p, no 4K. No video extension beyond 15 seconds in a single pass (although Morphic mentions a video extension capability, the exact details remain unclear).

For longer formats, you will need to combine several clips in a video editing tool — which remains the industry standard, even with the best AI tools on the market.


The impact for content creators

For YouTubers and shorts creators

Native audio is a game-changer. A 5-second clip with synchronized voiceover, sound effects, and music, generated in a single prompt for $0.70 — that's a workflow that didn't exist six months ago.

For YouTube creators who optimize their titles, thumbnails, and scripts with AI, Grok 1.5 completes the chain by adding synchronized B-roll generation. Instead of searching for stock videos and adding audio manually, everything comes out of a single prompt.

For agencies and brands

The pricing makes AI video generation viable at scale. A campaign of 50 personalized clips for social media costs about $35 in raw video with Grok 1.5, compared to $250 with Sora 2 Pro.

This is the tipping point where AI video generation goes from a demo gadget to an everyday production tool. Agencies that were hesitating because of Sora's costs no longer have an excuse.

For developers and SaaS products

The xAI API opens the door to video integration in products that couldn't afford to pay $30/minute. An SEO tool could generate illustrative videos for each article. An e-commerce tool could create personalized product demos on the fly.

The marginal cost of video drops to the level of the marginal cost of text. It sounds insignificant, but it is structurally transformative.


xAI's strategy: aggressive pricing + ecosystem integration

Why xAI can afford these prices

xAI doesn't need to monetize every video request. The strategy is ecosystem-driven: Grok Imagine Video fuels engagement on X (Twitter), generates additional training data, and attracts developers to the xAI API.

It's the same logic as Google with Veo integrated into Gemini, or ByteDance with Seedance integrated into Douyin/TikTok. The difference: xAI starts from behind and compensates with price. When you can't beat Google on resolution or ByteDance on training volume, you beat everyone on price.

The role of X (Twitter) in the strategy

In January 2026, Grok Imagine was already generating 1.245 billion videos per month via the X integration. This massive volume provided the data and feedback necessary to improve the model up to version 1.5.

No competitor has this integrated distribution channel. Even Veo, integrated into the Google ecosystem, doesn't benefit from a social network dedicated to short-form content. X serves simultaneously as a laboratory, a distribution channel, and a data source — a structural advantage that neither OpenAI nor Google can easily replicate.


❌ Common mistakes

Mistake 1: Confusing text-to-video and image-to-video

Arena rankings distinguish between these two categories. Seedance 2.0 leads in text-to-video (1454 Elo), Grok 1.5 leads in image-to-video (~1330-1421 Elo). Choosing a model solely based on its overall score without checking the input type leads to disappointing results. Always check the benchmark's scope.

Mistake 2: Ignoring the 720p limit

Grok Imagine Video 1.5 is capped at 720p 24 FPS. If your deliverable requires 1080p or 4K (broadcast, cinema, large-format display), this model is not sufficient. Veo 3.1 in 1080p remains the suitable choice in this case. Failing to check technical specifications before producing a batch of 100 clips is a costly mistake in terms of rework.

Mistake 3: Using Sora 2 Pro for new projects

The Sora 2 API will sunset on September 24, 2026. Starting a new project on it in June 2026 means committing to a forced migration in three months. Even if Sora 2 is temporarily available, OpenAI's signal is clear: the video segment is no longer a priority.

Mistake 4: Underestimating the cost of separate audio

Comparing only the price of the video without audio gives a false picture of the total cost. With Veo 3.1 without audio, the price is more competitive. But add an external TTS and a synchronization tool, and the total cost often exceeds that of Grok 1.5 in single-pass. Always factor the cost of audio into your calculations.


❓ Frequently Asked Questions

Does Grok Imagine Video 1.5 also generate videos from text?

Yes, via the grok-imagine-video endpoint (not the 1.5 version). But it's the image-to-video mode of the 1.5 that dominates the Arena leaderboard. For pure text-to-video, Seedance 2.0 remains ahead with 1454 Elo.

Is the native audio truly viable for professional use?

For short clips (5-15 seconds), yes. The lip-sync and sound effects are consistent according to Arena evaluation feedback. For complex dialogue or long narration, a dedicated TTS remains preferable. Grok 1.5's native audio excels at effects and ambiance, not extended narration.

Can Grok Imagine Video 1.5 be used for free?

xAI has not announced a free tier for this model. For free alternatives, check out our guide to the best free video AIs. Platforms like fal.ai offer test credits, but not sustained free usage.

How does Grok 1.5 compare to open-source options like LTX?

Open-source models (LTX, Wan 2.1) offer more control and flexibility, but their image-to-video quality remains below proprietary models. Wan 2.1 T2V 480p scores 1353 Elo, but that's in text-to-video and at 480p. For raw image-to-video quality, Grok 1.5 dominates.

What are the alternatives if I want 1080p with audio?

Google Veo 3.1 Audio 1080p (1402 Elo) is the best current option. The price is significantly higher, but the resolution is double. It's the classic quality/price trade-off — keep an eye on our overview of AI news as xAI could upgrade the resolution quickly.


✅ Conclusion

Grok Imagine Video 1.5 doesn't just add another video model to the market — it redefines the value for money of the segment and takes advantage of OpenAI's strategic withdrawal from Sora. Native single-pass audio removes a major bottleneck in creative workflows. At $4.20/minute in 720p, AI video generation becomes viable for volume productions. For creators looking to integrate video into their tool stack, including to reduce a website's costs in 2026, Grok 1.5 has become the default option in image-to-video. The best AI tools this quarter have a new leader in this segment — and competitors will have to adapt.