MarketingNew Release12 min readPublished Jun 10, 2026

Streaming speech-to-speech · 70+ languages · 2,000+ Meet combinations

Gemini 3.5 Live Translate: 70+ Languages, real-time multilingual CX

Google launched Gemini 3.5 Live Translate on June 9, 2026 — a streaming speech-to-speech model that translates across 70+ languages with automatic detection. Google Meet leaps from 5 English-only languages to 2,000+ combinations. The Live API is in public preview today; Meet is in private preview for select Workspace customers.

DA
Digital Applied Team
Senior strategists · Published Jun 10, 2026
PublishedJun 10, 2026
Read time12 min
SourcesGoogle + 9to5Google, MarkTechPost
Languages (auto-detect)
70+
single audio-to-audio model
Meet combinations (new)
2,000+
up from 5 English-only
400× vs old
Pipeline hops removed
3 → 0
no intermediate text step
Live API status
Preview
public preview · Meet private

Gemini 3.5 Live Translate is Google's real-time multilingual CX engine — a streaming speech-to-speech model that, per Google, listens in one language and speaks back in another across 70+ languages with automatic detection, staying only a few seconds behind the speaker. It launched on June 9, 2026 across Google Translate, the Gemini Live API, Google AI Studio, and Google Meet.

The headline most coverage led with — 70+ languages, "a few seconds" of lag — is the least interesting part of the story. The real shift is architectural: Google dropped the cascaded transcribe-then- translate-then-synthesize pipeline that powered Meet's earlier translation in favour of a single audio-to-audio model. That one change is why Google Meet jumps from 5 English-only language pairs to 2,000+ combinations in a single meeting, and why the translated voice can — by Google's account — track the speaker's own pacing instead of sounding flat.

This guide covers what actually shipped and on which surfaces, the architecture gap in plain English, a build-vs-buy decision matrix for multilingual CX teams, how to wire the public-preview Live API, the SynthID watermarking angle that maps directly onto the EU AI Act, and where the field — including DeepL Voice — sits today. Every number below is sourced to a primary or named-secondary citation, and the vendor-stated claims are labelled as such.

Key takeaways
  1. 01
    A single audio-to-audio model, four surfaces, one day.Gemini 3.5 Live Translate launched June 9, 2026 across Google Translate (iOS/Android), the Gemini Live API (public preview), Google AI Studio (public preview), and Google Meet (private preview for select Workspace customers).
  2. 02
    70+ languages with automatic detection.No manual language configuration — the model detects the spoken language on the fly and supports translation across 70+ languages, a meaningful step up from turn-based systems that required a pre-set pair.
  3. 03
    Google Meet jumps from 5 languages to 2,000+ combinations.Meet's prior speech translation (GA January 27, 2026) supported 5 English-only pairs. The new system supports 2,000+ language combinations in a single meeting — by removing the intermediate text step.
  4. 04
    The Live API is live for builders today; Meet is private preview.Any team with a developer can stand up a multilingual voice agent now via the public-preview Live API. Enterprise marketers can pursue the Meet private preview, with broader rollout planned for the second half of 2026.
  5. 05
    SynthID watermarking lines up with the EU AI Act.All generated audio is watermarked with Google's SynthID. The EU AI Act's transparency obligation on AI-generated content becomes enforceable August 2, 2026 — a regulatory-readiness signal for enterprise procurement and legal teams.

01What LaunchedOne model, four surfaces, in preview.

On June 9, 2026, Google released Gemini 3.5 Live Translate — model ID gemini-3.5-live-translate-preview — and rolled it onto four surfaces at once. The consumer Google Translate apps on iOS and Android picked up a live mode; the Gemini Live API and Google AI Studio opened it in public preview for developers; and Google Meet got it in private preview for select business Workspace customers, with broader enterprise rollout planned for the second half of 2026.

The model accepts spoken audio and returns spoken audio in the target language, detecting the source language automatically across 70+ languages. It is, importantly, a preview release: Google has not committed to production-grade SLAs, stable pricing, or general availability. Treat it as a strong signal of where multilingual CX is heading and a viable surface to pilot — not a finished product to build a mission-critical SLA on.

Consumer · live now
Google Translate
iOS / Android · headphones or Listening Mode

Tap 'Live translate' in the bottom-left corner with headphones connected. A new Android 'Listening Mode' lets users hold the phone to the ear like a call — no Pixel Buds required, removing the earbud dependency the earlier feature carried.

Any connected headphones
Builder · public preview
Gemini Live API
gemini-3.5-live-translate-preview · AI Studio

Audio-only, real-time translation via a translationConfig block. Available today in public preview through the Live API and Google AI Studio. Integration partners include Agora, Fishjam, LiveKit, Pipecat, and Vision Agents.

ai.google.dev/gemini-api
Enterprise · private preview
Google Meet
select Workspace customers · H2 2026 GA planned

Speech translation via a new button in the Meet control row. Currently private preview for select business Workspace customers; broader enterprise rollout is planned for the second half of 2026.

Private preview now
Launch snapshot
Gemini 3.5 Live Translate launched June 9, 2026 across Google Translate, the Gemini Live API and AI Studio (public preview), and Google Meet (private preview). The model card, capability matrix, and safety documentation are public on the Google AI for Developers model page. Per the docs, input and output token limits are 131,072 and 65,536 respectively, and the model is described as a "low-latency, audio-to-audio model optimized for real-time translation of spoken conversations."

02The Architecture GapWhy audio-to-audio beats the three-hop pipeline.

Most coverage says "continuous translation" and moves on. The detail that actually matters is what got removed. Older real-time translation — including Meet's previous system — ran a cascaded three-step pipeline: transcribe the speech to text (STT), translate that text, then synthesize the translated text back to speech (TTS). Each hop adds latency and a place for errors to compound — a mistranscription becomes a mistranslation becomes a confidently wrong spoken sentence.

Gemini 3.5 Live Translate replaces all three hops with a single audio-to-audio model: speech goes in, speech comes out, with no intermediate text representation. Google describes the result as streaming continuously rather than waiting for sentence boundaries — which is what lets the system, in Google's words, stay "just a few seconds behind the speaker throughout the session." There is no independent latency benchmark for that claim as of this writing, so treat the "few seconds" figure as Google-stated.

What changed: Meet translation, old pipeline vs Gemini 3.5 Live Translate

Source: Google Blog (Jun 9, 2026) + Google Workspace Updates (Jan 27, 2026 GA). Latency and language figures are vendor-stated.
Pipeline hopsCascaded STT → translate → TTS vs single model
3 → 0
Languages (Meet)5 English-only pairs → 70+ with auto-detect
5 → 70+
Combinations per meeting (Meet)5 → 2,000+ language combinations
5 → 2,000+
Latency (Google-stated)Turn-based wait → continuous, 'a few seconds' behind
few sec

The second-order effect is the one CX teams should care about. When translation no longer routes through a text bottleneck, the translated audio can — per Google — preserve the speaker's intonation, pacing, and pitch rather than producing flat synthesized output. That matters because tone carries meaning: a reassuring support agent and a curt one say the same words. We haven't seen an independent A/B listening test of that claim, so it remains Google-stated — but the architectural reason it's plausible (no TTS voice substitution step) is sound.

"balancing the trade-off between waiting for context to improve quality and translating immediately to stay in sync with the speaker"— Anuda Weerasinghe and Tony Lu, Google (via Technobezz)

That quote from Google's product and engineering leads is the whole game in one sentence. Every real-time translation system lives on a spectrum between waiting (more context, better quality, more lag) and translating immediately (lower lag, riskier on ambiguous phrasing). The continuous-stream architecture is Google's bet that you can sit closer to the immediate end without the quality cliff that turn-based systems hit at every sentence boundary. The consumer side of this evolution is worth watching alongside Google Translate's live headphone mode, which this release extends.

03Google MeetFrom 5 languages to 2,000+ combinations, in 133 days.

Here is the speed story. Google Meet reached general availability on its speech translation feature on January 27, 2026 — supporting exactly five language pairs, all English-only: English to and from Spanish, French, German, Portuguese, and Italian, on Workspace Business Standard/Plus and Enterprise Standard/Plus plans. That was the state of the art for Meet roughly four months ago.

On June 9, 2026, that system was effectively obsoleted: 70+ languages, 2,000+ combinations in a single meeting, automatic detection. From GA to superseded in 133 days. For enterprise marketers and CX leaders, the lesson isn't the specific numbers — it's the tempo. Translation capability that you'd have scoped as a year-long roadmap item in 2024 is now shipping and re-shipping inside a single quarter. The table below makes the jump concrete.

Google Meet translation: the January 2026 GA system compared with Gemini 3.5 Live Translate, June 2026.
DimensionMeet · GA (Jan 2026)Meet · Gemini 3.5 Live Translate (Jun 2026)
ArchitectureCascaded: STT → text translate → TTS (3 hops)Single audio-to-audio model (0 intermediate text hops)
Languages supported570+ (automatic detection)
Combinations per meeting5 (English-only pairs)2,000+
Latency modeTurn-based (waits for sentence end)Continuous streaming · "a few seconds" behind (Google-stated)
Voice preservationLimited (synthesized voice, not speaker-matched)Preserves intonation, pacing, pitch (Google-stated)
Rollout statusGA · Business Std/Plus, Enterprise Std/PlusPrivate preview → H2 2026 broader rollout planned

One caveat worth setting for stakeholders: the January GA system is generally available across eligible plans today, while the new model is private-preview only. If you're planning multilingual all-hands or customer calls in the next quarter, the five-language GA system is what you can actually rely on right now; the 2,000+ combination experience is a pilot you sign up for, not a switch you flip.

04Build vs BuyFour access paths, one decision matrix.

The most useful framing for a CX or marketing leader isn't "is this good?" — it's "which door do I walk through, and when?" There are four distinct ways to put Gemini 3.5 Live Translate to work, and they sort cleanly along two axes: how much engineering you have, and whether your channel is a custom app, a consumer touchpoint, or video meetings. The matrix below maps each path to a realistic timeline and its key constraint.

Build-vs-buy decision matrix for deploying Gemini 3.5 Live Translate across four access paths, with timeline and key constraint for each.
Buyer typeChannelAccess pathTimelineKey constraint
Developer / startupCustom voice appGemini Live API public previewTodayMust handle a raw PCM audio pipeline
Developer / startupConsumer touchpoint (kiosk, event)Translate Android "Listening Mode"TodaySingle-device, in-person only
Enterprise marketerVideo meetings (sales, support, CS)Google Meet private previewJune 2026 sign-upWorkspace eligibility required
Enterprise marketerFull CX voice stackVia Agora / LiveKit / PipecatToday (partner platforms)Partner SLA / pricing varies
Non-developer teamInternal multilingual meetingsGoogle Meet → H2 2026 broader rolloutH2 2026 (wait)No dev required — standard Meet
Have a developer, need it now
Custom voice agent via the Live API

If you have engineering capacity and a contact-centre, webinar, or event use case, the public-preview Live API is the move — you can stand up a multilingual voice agent today. The trade-off is owning the raw audio pipeline and accepting preview-grade stability.

Build on the Live API
Enterprise, video-meeting use case
Pilot Google Meet

If your need is multilingual sales, support, or CS calls and you're on an eligible Workspace plan, sign up for the Meet private preview now. Keep the January GA five-language system as the reliable fallback until broader rollout.

Sign up for Meet preview
Production CX stack, no Google lock-in
Embed via RTC partners

For a full CX voice stack you don't want tied to Meet, the Live API integrates through Agora, Fishjam, LiveKit, Pipecat, and Vision Agents. This is the path for embedding translation into an existing voice pipeline — but partner SLAs and pricing vary, so scope those before committing.

Integrate via a partner
No engineering, internal-only need
Wait for Meet GA

If you have no developer and the use case is internal multilingual meetings, the disciplined move is to wait for the broader Meet rollout planned for H2 2026. Don't over-build for a capability that's about to arrive as a standard Meet button.

Wait for H2 2026

Our read: most marketing and CX teams should be in one of two camps right now. If you have engineering capacity, prototype a single high-value voice touchpoint on the Live API this quarter — the learning is worth more than the polish. If you don't, sign up for the Meet preview to evaluate, but plan your reliable multilingual coverage around what's GA today. Either way, the voice model is one layer; standing up a production-grade voice agent infrastructure stack needs turn-taking, telephony, observability, and fallback handling around it.

05Wiring the APIThe translationConfig block, decoded.

For builders, the Live API is refreshingly narrow by design. You configure translation through a translationConfig block inside generationConfig, with two key fields: targetLanguageCode (a BCP-47 tag like "pl", "es", defaulting to "en") and echoTargetLanguage (a boolean controlling whether input already in the target language is echoed back or silenced). Source language detection is automatic.

The audio specs are exact and worth getting right before you debug a silent stream: input is raw 16-bit PCM at 16 kHz, mono, little-endian; output is raw 16-bit PCM at 24 kHz, mono, little-endian; and the recommended chunk size is 100 ms. Optional inputAudioTranscription and outputAudioTranscription flags return text transcripts alongside the translated audio — useful for accessibility, logging, and compliance trails.

Audio input
Raw 16-bit PCM, mono
16kHz

Little-endian, recommended 100 ms chunks. Text input is explicitly unsupported in translation mode — the model accepts audio only, to hold its real-time latency targets.

Audio-only in
Audio output
Raw 16-bit PCM, mono
24kHz

Little-endian translated speech. Optional inputAudioTranscription and outputAudioTranscription flags add text transcripts for accessibility, logging, and compliance.

Transcripts optional
Token limits
Input · 65K output
131K

Input limit 131,072, output 65,536 per the model card. Function calling, system instructions, and Search grounding are unavailable in translation mode — it is a single-purpose translation surface.

Translation mode only
Security note for client-side apps
For client-side applications, the Live API supports ephemeral token authentication on the v1alpha endpoint. Tokens can lock the translation configuration by default, which prevents end users from tampering with the target language or other settings — the right pattern for a kiosk, event device, or public-facing app where you don't control the client.

Two limitations to plan around. First, translation mode strips the features you might reflexively reach for — no function calling, no system instructions, no Search grounding — because the model is optimised for one job. If your agent needs to do things as well as translate, you'll orchestrate translation as one node in a larger pipeline, not as the whole agent. Second, Google's own docs flag three preview-stage limitations: voice replication can drift across long pauses, language detection can struggle with heavy accents and similar language pairs, and background-audio filtering is available but incomplete. Build your pilot around clean-audio, single-speaker scenarios first.

06SynthID & ComplianceThe watermark that maps onto the EU AI Act.

Here's the angle nobody else is connecting. Every audio output Gemini 3.5 Live Translate generates is watermarked with Google's SynthID — an imperceptible marker embedded directly into the audio waveform to flag the content as AI-generated. On its own, that's a responsible-AI footnote. In the context of the EU AI Act, it's a procurement differentiator.

The EU AI Act's transparency obligation requiring machine-readable labels on AI-generated content (Article 50) becomes enforceable on August 2, 2026. Enterprises deploying AI voice translation into European customer interactions will need a story for how generated audio is labelled. SynthID watermarking on every output means the model arrives ahead of that deadline by default — which is exactly the kind of vendor-risk question enterprise legal and procurement teams ask, and which most launch coverage skipped entirely.

For legal & procurement teams
The relevant EU AI Act provision for AI-content labelling is Article 50 (transparency), enforceable August 2, 2026 — not Article 73, which covers a different obligation. Google's SynthID audio watermarking is described as embedded directly into the waveform on every generated output. Verify how this maps to your specific compliance obligations with counsel before relying on it as a control; vendor watermarking is one input to a labelling strategy, not the entirety of one.

07The FieldDeepL, the scale story, and what's real.

Google isn't alone in voice. DeepL — long known for text translation — launched DeepL Voice on April 16, 2026, with real-time voice-to-voice translation across 40+ languages targeting enterprise meetings and customer service. The important architectural distinction: DeepL Voice currently runs a cascaded STT → translate → TTS pipeline, not an end-to-end audio model. DeepL's roadmap mentions an end-to-end model in development, but as of this writing it has not shipped. So the head-to-head isn't "two end-to-end models" — it's Google's audio-to-audio approach versus DeepL's cascaded one, with DeepL's historically strong translation quality as its differentiator.

For scale context, Google Translate as a whole processes roughly one trillion words per month across Translate, Search, Lens, and Circle to Search, serving over a billion monthly users across nearly 250 languages and covering about 95% of the world's population. Notably, more than a third of Google Translate's live-translate sessions last longer than five minutes — a signal that people use it for real conversations, not just quick lookups. That's the install base Gemini 3.5 Live Translate is being pushed into.

"over a trillion words being translated for billions of users across our products every month"— Google (Translate product team), 20th anniversary blog

One real-world deployment worth naming carefully: per a Google partner announcement, the Southeast Asian ride-hailing platform Grab is testing Gemini 3.5 Live Translate for real-time driver-passenger communication at pickups, on a platform Google describes as carrying over 10 million voice calls per month. That figure comes from Google's partner announcement rather than Grab's own filings, so treat it as indicative of scale rather than an audited number. The pattern it illustrates, though, is the one CX teams should internalise: a language mismatch at the moment of a pickup is the kind of silent friction that quietly fails interactions no dashboard flags.

08The CX PlaybookWhere language mismatch silently costs you.

Strip away the model specs and the question for a CX leader is simple: where in your customer journey does a language mismatch cause churn that never shows up as a complaint? Unanswered support calls. Abandoned chats. Tickets that get opened, half-understood, and quietly closed. The business case for multilingual CX has always leaned on the intuition that people prefer to transact in their own language — an idea long popularised in localisation research — and real-time voice translation is the first technology that lets a small team serve that preference live, without a multilingual headcount.

The strategic move isn't "turn on translation everywhere." It's to map the two or three touchpoints where language is currently a silent failure point, pilot one of them on the access path that matches your resources, and instrument it so you can actually measure the difference — resolution rate, call completion, conversion on previously-stalled segments. Real-time voice is one layer of a multilingual CX strategy; the web layer still needs your multilingual SEO and hreflang strategy done right in parallel, or you'll translate the conversation and lose the customer before they ever start it.

The DA angle
The teams that win here won't be the ones who adopt the flashiest demo — they'll be the ones who pick the right access path for their resources and pilot a single high-value touchpoint with real measurement. Our AI transformation engagements start with exactly that: mapping where language friction costs you, choosing build-vs-buy honestly, and standing up a measured pilot — delivered in days, not quarters.

09ConclusionThe capability is here; the discipline is the differentiator.

Real-time multilingual CX, June 2026

The model is the easy part — knowing which door to walk through is the strategy.

Gemini 3.5 Live Translate is a genuine step change in real-time translation, and the reason is architectural, not numeric: removing the cascaded transcribe-translate-synthesize pipeline in favour of a single audio-to-audio model is what unlocks 70+ languages with automatic detection, 2,000+ combinations in a single Meet call, and a translated voice that — by Google's account — keeps the speaker's pacing.

The honest framing matters. The latency, voice-preservation, and partner-scale claims are Google-stated and not yet independently benchmarked, and the model is in preview — no production SLA, no stable pricing, no GA. That's not a reason to wait; it's a reason to pilot rather than bet the contact centre. The Live API is live for builders today, the Meet private preview is open for enterprise evaluation, and the reliable GA fallback is still the five-language system from January.

The differentiator for CX teams won't be access to the model — everyone gets that. It'll be the discipline of mapping where language mismatch silently costs you, picking the access path that matches your engineering reality, layering it onto the rest of your multilingual footprint, and measuring the result. The capability has arrived faster than most roadmaps planned for. The strategy is still yours to get right.

Pilot real-time multilingual CX

Real-time voice removes the language barrier — knowing where to point it is the work.

Our team helps marketing and CX teams evaluate, pilot, and operate real-time multilingual voice — choosing build-vs-buy honestly, wiring the right access path, and standing up a measured pilot, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Multilingual CX engagements

  • Mapping where language mismatch silently costs conversion
  • Build-vs-buy: Live API vs Meet vs RTC partner pilots
  • Voice agent pipelines — audio, telephony, fallback handling
  • SynthID & EU AI Act labelling readiness for procurement
  • Multilingual web layer — hreflang & international SEO in parallel
FAQ · Gemini 3.5 Live Translate

The questions we get every week.

Gemini 3.5 Live Translate is Google's real-time, speech-to-speech translation model (model ID gemini-3.5-live-translate-preview). It listens in one language and speaks back in another across 70+ languages, detecting the source language automatically and staying — per Google — only a few seconds behind the speaker. It launched on June 9, 2026 across four surfaces: the Google Translate apps for iOS and Android, the Gemini Live API (public preview), Google AI Studio (public preview), and Google Meet (private preview for select business Workspace customers). It is a preview release, so Google has not committed to production SLAs, stable pricing, or general availability yet.