Gemini 3.5 Live Translate is Google's real-time multilingual CX engine — a streaming speech-to-speech model that, per Google, listens in one language and speaks back in another across 70+ languages with automatic detection, staying only a few seconds behind the speaker. It launched on June 9, 2026 across Google Translate, the Gemini Live API, Google AI Studio, and Google Meet.
The headline most coverage led with — 70+ languages, "a few seconds" of lag — is the least interesting part of the story. The real shift is architectural: Google dropped the cascaded transcribe-then- translate-then-synthesize pipeline that powered Meet's earlier translation in favour of a single audio-to-audio model. That one change is why Google Meet jumps from 5 English-only language pairs to 2,000+ combinations in a single meeting, and why the translated voice can — by Google's account — track the speaker's own pacing instead of sounding flat.
This guide covers what actually shipped and on which surfaces, the architecture gap in plain English, a build-vs-buy decision matrix for multilingual CX teams, how to wire the public-preview Live API, the SynthID watermarking angle that maps directly onto the EU AI Act, and where the field — including DeepL Voice — sits today. Every number below is sourced to a primary or named-secondary citation, and the vendor-stated claims are labelled as such.
- 01A single audio-to-audio model, four surfaces, one day.Gemini 3.5 Live Translate launched June 9, 2026 across Google Translate (iOS/Android), the Gemini Live API (public preview), Google AI Studio (public preview), and Google Meet (private preview for select Workspace customers).
- 0270+ languages with automatic detection.No manual language configuration — the model detects the spoken language on the fly and supports translation across 70+ languages, a meaningful step up from turn-based systems that required a pre-set pair.
- 03Google Meet jumps from 5 languages to 2,000+ combinations.Meet's prior speech translation (GA January 27, 2026) supported 5 English-only pairs. The new system supports 2,000+ language combinations in a single meeting — by removing the intermediate text step.
- 04The Live API is live for builders today; Meet is private preview.Any team with a developer can stand up a multilingual voice agent now via the public-preview Live API. Enterprise marketers can pursue the Meet private preview, with broader rollout planned for the second half of 2026.
- 05SynthID watermarking lines up with the EU AI Act.All generated audio is watermarked with Google's SynthID. The EU AI Act's transparency obligation on AI-generated content becomes enforceable August 2, 2026 — a regulatory-readiness signal for enterprise procurement and legal teams.
01 — What LaunchedOne model, four surfaces, in preview.
On June 9, 2026, Google released Gemini 3.5 Live Translate — model ID gemini-3.5-live-translate-preview — and rolled it onto four surfaces at once. The consumer Google Translate apps on iOS and Android picked up a live mode; the Gemini Live API and Google AI Studio opened it in public preview for developers; and Google Meet got it in private preview for select business Workspace customers, with broader enterprise rollout planned for the second half of 2026.
The model accepts spoken audio and returns spoken audio in the target language, detecting the source language automatically across 70+ languages. It is, importantly, a preview release: Google has not committed to production-grade SLAs, stable pricing, or general availability. Treat it as a strong signal of where multilingual CX is heading and a viable surface to pilot — not a finished product to build a mission-critical SLA on.
Google Translate
Tap 'Live translate' in the bottom-left corner with headphones connected. A new Android 'Listening Mode' lets users hold the phone to the ear like a call — no Pixel Buds required, removing the earbud dependency the earlier feature carried.
Gemini Live API
Audio-only, real-time translation via a translationConfig block. Available today in public preview through the Live API and Google AI Studio. Integration partners include Agora, Fishjam, LiveKit, Pipecat, and Vision Agents.
Google Meet
Speech translation via a new button in the Meet control row. Currently private preview for select business Workspace customers; broader enterprise rollout is planned for the second half of 2026.
02 — The Architecture GapWhy audio-to-audio beats the three-hop pipeline.
Most coverage says "continuous translation" and moves on. The detail that actually matters is what got removed. Older real-time translation — including Meet's previous system — ran a cascaded three-step pipeline: transcribe the speech to text (STT), translate that text, then synthesize the translated text back to speech (TTS). Each hop adds latency and a place for errors to compound — a mistranscription becomes a mistranslation becomes a confidently wrong spoken sentence.
Gemini 3.5 Live Translate replaces all three hops with a single audio-to-audio model: speech goes in, speech comes out, with no intermediate text representation. Google describes the result as streaming continuously rather than waiting for sentence boundaries — which is what lets the system, in Google's words, stay "just a few seconds behind the speaker throughout the session." There is no independent latency benchmark for that claim as of this writing, so treat the "few seconds" figure as Google-stated.
What changed: Meet translation, old pipeline vs Gemini 3.5 Live Translate
Source: Google Blog (Jun 9, 2026) + Google Workspace Updates (Jan 27, 2026 GA). Latency and language figures are vendor-stated.The second-order effect is the one CX teams should care about. When translation no longer routes through a text bottleneck, the translated audio can — per Google — preserve the speaker's intonation, pacing, and pitch rather than producing flat synthesized output. That matters because tone carries meaning: a reassuring support agent and a curt one say the same words. We haven't seen an independent A/B listening test of that claim, so it remains Google-stated — but the architectural reason it's plausible (no TTS voice substitution step) is sound.
"balancing the trade-off between waiting for context to improve quality and translating immediately to stay in sync with the speaker"— Anuda Weerasinghe and Tony Lu, Google (via Technobezz)
That quote from Google's product and engineering leads is the whole game in one sentence. Every real-time translation system lives on a spectrum between waiting (more context, better quality, more lag) and translating immediately (lower lag, riskier on ambiguous phrasing). The continuous-stream architecture is Google's bet that you can sit closer to the immediate end without the quality cliff that turn-based systems hit at every sentence boundary. The consumer side of this evolution is worth watching alongside Google Translate's live headphone mode, which this release extends.
03 — Google MeetFrom 5 languages to 2,000+ combinations, in 133 days.
Here is the speed story. Google Meet reached general availability on its speech translation feature on January 27, 2026 — supporting exactly five language pairs, all English-only: English to and from Spanish, French, German, Portuguese, and Italian, on Workspace Business Standard/Plus and Enterprise Standard/Plus plans. That was the state of the art for Meet roughly four months ago.
On June 9, 2026, that system was effectively obsoleted: 70+ languages, 2,000+ combinations in a single meeting, automatic detection. From GA to superseded in 133 days. For enterprise marketers and CX leaders, the lesson isn't the specific numbers — it's the tempo. Translation capability that you'd have scoped as a year-long roadmap item in 2024 is now shipping and re-shipping inside a single quarter. The table below makes the jump concrete.
| Dimension | Meet · GA (Jan 2026) | Meet · Gemini 3.5 Live Translate (Jun 2026) |
|---|---|---|
| Architecture | Cascaded: STT → text translate → TTS (3 hops) | Single audio-to-audio model (0 intermediate text hops) |
| Languages supported | 5 | 70+ (automatic detection) |
| Combinations per meeting | 5 (English-only pairs) | 2,000+ |
| Latency mode | Turn-based (waits for sentence end) | Continuous streaming · "a few seconds" behind (Google-stated) |
| Voice preservation | Limited (synthesized voice, not speaker-matched) | Preserves intonation, pacing, pitch (Google-stated) |
| Rollout status | GA · Business Std/Plus, Enterprise Std/Plus | Private preview → H2 2026 broader rollout planned |
One caveat worth setting for stakeholders: the January GA system is generally available across eligible plans today, while the new model is private-preview only. If you're planning multilingual all-hands or customer calls in the next quarter, the five-language GA system is what you can actually rely on right now; the 2,000+ combination experience is a pilot you sign up for, not a switch you flip.
04 — Build vs BuyFour access paths, one decision matrix.
The most useful framing for a CX or marketing leader isn't "is this good?" — it's "which door do I walk through, and when?" There are four distinct ways to put Gemini 3.5 Live Translate to work, and they sort cleanly along two axes: how much engineering you have, and whether your channel is a custom app, a consumer touchpoint, or video meetings. The matrix below maps each path to a realistic timeline and its key constraint.
| Buyer type | Channel | Access path | Timeline | Key constraint |
|---|---|---|---|---|
| Developer / startup | Custom voice app | Gemini Live API public preview | Today | Must handle a raw PCM audio pipeline |
| Developer / startup | Consumer touchpoint (kiosk, event) | Translate Android "Listening Mode" | Today | Single-device, in-person only |
| Enterprise marketer | Video meetings (sales, support, CS) | Google Meet private preview | June 2026 sign-up | Workspace eligibility required |
| Enterprise marketer | Full CX voice stack | Via Agora / LiveKit / Pipecat | Today (partner platforms) | Partner SLA / pricing varies |
| Non-developer team | Internal multilingual meetings | Google Meet → H2 2026 broader rollout | H2 2026 (wait) | No dev required — standard Meet |
Custom voice agent via the Live API
If you have engineering capacity and a contact-centre, webinar, or event use case, the public-preview Live API is the move — you can stand up a multilingual voice agent today. The trade-off is owning the raw audio pipeline and accepting preview-grade stability.
Pilot Google Meet
If your need is multilingual sales, support, or CS calls and you're on an eligible Workspace plan, sign up for the Meet private preview now. Keep the January GA five-language system as the reliable fallback until broader rollout.
Embed via RTC partners
For a full CX voice stack you don't want tied to Meet, the Live API integrates through Agora, Fishjam, LiveKit, Pipecat, and Vision Agents. This is the path for embedding translation into an existing voice pipeline — but partner SLAs and pricing vary, so scope those before committing.
Wait for Meet GA
If you have no developer and the use case is internal multilingual meetings, the disciplined move is to wait for the broader Meet rollout planned for H2 2026. Don't over-build for a capability that's about to arrive as a standard Meet button.
Our read: most marketing and CX teams should be in one of two camps right now. If you have engineering capacity, prototype a single high-value voice touchpoint on the Live API this quarter — the learning is worth more than the polish. If you don't, sign up for the Meet preview to evaluate, but plan your reliable multilingual coverage around what's GA today. Either way, the voice model is one layer; standing up a production-grade voice agent infrastructure stack needs turn-taking, telephony, observability, and fallback handling around it.
05 — Wiring the APIThe translationConfig block, decoded.
For builders, the Live API is refreshingly narrow by design. You configure translation through a translationConfig block inside generationConfig, with two key fields: targetLanguageCode (a BCP-47 tag like "pl", "es", defaulting to "en") and echoTargetLanguage (a boolean controlling whether input already in the target language is echoed back or silenced). Source language detection is automatic.
The audio specs are exact and worth getting right before you debug a silent stream: input is raw 16-bit PCM at 16 kHz, mono, little-endian; output is raw 16-bit PCM at 24 kHz, mono, little-endian; and the recommended chunk size is 100 ms. Optional inputAudioTranscription and outputAudioTranscription flags return text transcripts alongside the translated audio — useful for accessibility, logging, and compliance trails.
Raw 16-bit PCM, mono
Little-endian, recommended 100 ms chunks. Text input is explicitly unsupported in translation mode — the model accepts audio only, to hold its real-time latency targets.
Raw 16-bit PCM, mono
Little-endian translated speech. Optional inputAudioTranscription and outputAudioTranscription flags add text transcripts for accessibility, logging, and compliance.
Input · 65K output
Input limit 131,072, output 65,536 per the model card. Function calling, system instructions, and Search grounding are unavailable in translation mode — it is a single-purpose translation surface.
v1alpha endpoint. Tokens can lock the translation configuration by default, which prevents end users from tampering with the target language or other settings — the right pattern for a kiosk, event device, or public-facing app where you don't control the client.Two limitations to plan around. First, translation mode strips the features you might reflexively reach for — no function calling, no system instructions, no Search grounding — because the model is optimised for one job. If your agent needs to do things as well as translate, you'll orchestrate translation as one node in a larger pipeline, not as the whole agent. Second, Google's own docs flag three preview-stage limitations: voice replication can drift across long pauses, language detection can struggle with heavy accents and similar language pairs, and background-audio filtering is available but incomplete. Build your pilot around clean-audio, single-speaker scenarios first.
06 — SynthID & ComplianceThe watermark that maps onto the EU AI Act.
Here's the angle nobody else is connecting. Every audio output Gemini 3.5 Live Translate generates is watermarked with Google's SynthID — an imperceptible marker embedded directly into the audio waveform to flag the content as AI-generated. On its own, that's a responsible-AI footnote. In the context of the EU AI Act, it's a procurement differentiator.
The EU AI Act's transparency obligation requiring machine-readable labels on AI-generated content (Article 50) becomes enforceable on August 2, 2026. Enterprises deploying AI voice translation into European customer interactions will need a story for how generated audio is labelled. SynthID watermarking on every output means the model arrives ahead of that deadline by default — which is exactly the kind of vendor-risk question enterprise legal and procurement teams ask, and which most launch coverage skipped entirely.
07 — The FieldDeepL, the scale story, and what's real.
Google isn't alone in voice. DeepL — long known for text translation — launched DeepL Voice on April 16, 2026, with real-time voice-to-voice translation across 40+ languages targeting enterprise meetings and customer service. The important architectural distinction: DeepL Voice currently runs a cascaded STT → translate → TTS pipeline, not an end-to-end audio model. DeepL's roadmap mentions an end-to-end model in development, but as of this writing it has not shipped. So the head-to-head isn't "two end-to-end models" — it's Google's audio-to-audio approach versus DeepL's cascaded one, with DeepL's historically strong translation quality as its differentiator.
For scale context, Google Translate as a whole processes roughly one trillion words per month across Translate, Search, Lens, and Circle to Search, serving over a billion monthly users across nearly 250 languages and covering about 95% of the world's population. Notably, more than a third of Google Translate's live-translate sessions last longer than five minutes — a signal that people use it for real conversations, not just quick lookups. That's the install base Gemini 3.5 Live Translate is being pushed into.
"over a trillion words being translated for billions of users across our products every month"— Google (Translate product team), 20th anniversary blog
One real-world deployment worth naming carefully: per a Google partner announcement, the Southeast Asian ride-hailing platform Grab is testing Gemini 3.5 Live Translate for real-time driver-passenger communication at pickups, on a platform Google describes as carrying over 10 million voice calls per month. That figure comes from Google's partner announcement rather than Grab's own filings, so treat it as indicative of scale rather than an audited number. The pattern it illustrates, though, is the one CX teams should internalise: a language mismatch at the moment of a pickup is the kind of silent friction that quietly fails interactions no dashboard flags.
08 — The CX PlaybookWhere language mismatch silently costs you.
Strip away the model specs and the question for a CX leader is simple: where in your customer journey does a language mismatch cause churn that never shows up as a complaint? Unanswered support calls. Abandoned chats. Tickets that get opened, half-understood, and quietly closed. The business case for multilingual CX has always leaned on the intuition that people prefer to transact in their own language — an idea long popularised in localisation research — and real-time voice translation is the first technology that lets a small team serve that preference live, without a multilingual headcount.
The strategic move isn't "turn on translation everywhere." It's to map the two or three touchpoints where language is currently a silent failure point, pilot one of them on the access path that matches your resources, and instrument it so you can actually measure the difference — resolution rate, call completion, conversion on previously-stalled segments. Real-time voice is one layer of a multilingual CX strategy; the web layer still needs your multilingual SEO and hreflang strategy done right in parallel, or you'll translate the conversation and lose the customer before they ever start it.
09 — ConclusionThe capability is here; the discipline is the differentiator.
The model is the easy part — knowing which door to walk through is the strategy.
Gemini 3.5 Live Translate is a genuine step change in real-time translation, and the reason is architectural, not numeric: removing the cascaded transcribe-translate-synthesize pipeline in favour of a single audio-to-audio model is what unlocks 70+ languages with automatic detection, 2,000+ combinations in a single Meet call, and a translated voice that — by Google's account — keeps the speaker's pacing.
The honest framing matters. The latency, voice-preservation, and partner-scale claims are Google-stated and not yet independently benchmarked, and the model is in preview — no production SLA, no stable pricing, no GA. That's not a reason to wait; it's a reason to pilot rather than bet the contact centre. The Live API is live for builders today, the Meet private preview is open for enterprise evaluation, and the reliable GA fallback is still the five-language system from January.
The differentiator for CX teams won't be access to the model — everyone gets that. It'll be the discipline of mapping where language mismatch silently costs you, picking the access path that matches your engineering reality, layering it onto the rest of your multilingual footprint, and measuring the result. The capability has arrived faster than most roadmaps planned for. The strategy is still yours to get right.