Google launched Gemini 3.1 Flash Live on March 26, 2026 — and it’s not a minor update. The model scores 90.8% on ComplexFuncBench Audio, a roughly 20% jump over the previous generation, while hitting sub-second time-to-first-token at 960ms. That’s faster than most humans’ conversational reaction time, and it’s the first time a Google voice model has genuinely challenged ElevenLabs and Cartesia on the metrics that matter for real-time voice agent pipelines.
The voice AI race just got significantly more interesting — and Google, for once, looks like it showed up early.
Rating: 8.4/10 ⭐⭐⭐⭐
What Is Gemini 3.1 Flash Live?
Gemini 3.1 Flash Live is Google’s highest-quality real-time audio and voice AI model, built specifically for natural, low-latency dialogue. Developed by the Gemini team at Google DeepMind, it launched on March 26, 2026, making it one of the fastest same-day launches in Google’s AI product history.
The model is available across three tiers: developers get API access via the Gemini Live API in Google AI Studio, enterprises get it through Gemini Enterprise for Customer Experience, and everyday users experience it embedded inside Gemini Live and Search Live (now available in 200+ countries).
The one-line differentiator: it’s the only voice AI that collapses transcription, reasoning, and synthesis into a single audio-to-audio pipeline at sub-second latency — no separate STT, LLM, and TTS stack required.
The Story: 960ms Latency and a 90.8% Benchmark Nobody Else Is Talking About
Every voice AI review leads with “it sounds natural.” That’s table stakes. Here’s the number that actually matters for anyone building production voice agents: 960 milliseconds time-to-first-token at minimum thinking level.
Human conversational response time averages 200–300ms for simple exchanges, but tolerance for AI voice latency sits closer to 1,000–1,500ms before it starts feeling robotic. At 960ms, Gemini 3.1 Flash Live sits right at the edge of that window — and it’s the first time a frontier AI voice model has gotten there without sacrificing reasoning capability.
The benchmark story is equally compelling. On ComplexFuncBench Audio — a benchmark specifically designed to test multi-step function calling with complex constraints using audio input only — the model scored 90.8%. The previous Gemini Flash generation scored roughly 71-72% on the same benchmark. That’s not an incremental improvement. That’s a generational leap in the model’s ability to hear a multi-step instruction, maintain context, call external tools, and respond — all in a live audio stream.
On Scale AI’s Audio MultiChallenge, which tests complex instruction following and long-horizon reasoning through real-world audio conditions (interruptions, background noise, hesitations), the model scored 36.1% with thinking enabled. For context: this benchmark is specifically designed to be brutal, testing scenarios that break most production voice systems. 36.1% in a field where even top models typically cluster in the 20–30% range is significant.
The practical implication: you can now build a voice agent that hears “Schedule a meeting with John next Tuesday after 2pm, avoid anything that conflicts with the marketing sync, and send him a calendar invite with the Zoom link” — and actually execute it reliably, in real-time, without the agent breaking down on the constraint chain.
Benchmark Performance
| Benchmark | Gemini 3.1 Flash Live | Previous Gemini Flash | What It Tests |
|---|---|---|---|
| ComplexFuncBench Audio | 90.8% | ~71% | Multi-step function calling via audio input |
| Audio MultiChallenge (thinking on) | 36.1% | ~24% | Complex instruction following in noisy real-world audio |
| Time-to-First-Token (min thinking) | ~960ms | ~1,400ms | Conversational latency |
| Time-to-First-Token (max thinking) | ~2,980ms | ~3,500ms | Complex reasoning latency |
| Languages Supported | 90+ | ~70 | Multilingual real-time dialogue |
| Context Window (conversation thread) | 2x previous model | Baseline | Longer coherent conversations |
Sources: Google Blog (March 26, 2026), ComplexFuncBench GitHub, Scale AI Audio MultiChallenge Leaderboard. Previous model scores are approximate based on published comparisons.
Pricing
Gemini 3.1 Flash Live is currently in preview. API pricing via Google AI Studio:
| Input/Output Type | Price per 1M Tokens | Equivalent Per-Minute Rate |
|---|---|---|
| Text Input | $0.75 | — |
| Audio Input | $3.00 | $0.005/min |
| Image/Video Input | $1.00 | $0.002/min |
| Text Output | $4.50 | — |
| Audio Output | $12.00 | $0.018/min |
Source: Google AI Studio Pricing. Prices current as of March 2026, preview pricing subject to change at GA.
How Google’s Pricing Compares
| Provider | Entry API Cost | Real-Time Agent Cost | Best For |
|---|---|---|---|
| Gemini 3.1 Flash Live | $0.005/min audio in | $0.018/min audio out | Real-time conversational agents |
| ElevenLabs (API) | ~$0.015–$0.030/min | ~$0.05–$0.12/min (agent) | High-quality TTS, voice cloning |
| Cartesia | ~$0.005/min (credits) | ~$0.015–$0.025/min | Ultra-low latency TTS |
| OpenAI TTS (gpt-4o-mini) | ~$0.015/min audio out | ~$0.015/min | General TTS with GPT integration |
| PlayAI | ~$0.019/min | Custom/enterprise | Voice cloning + agent workflows |
Competitor pricing sourced from published rate cards as of March 2026. Per-minute equivalents are approximated from character/token rates where exact per-minute pricing wasn’t published.
The headline: Google’s audio output rate of $0.018/min is competitive with OpenAI and cheaper than most ElevenLabs agent tiers at scale. At 10,000 agent-hours per month, the savings over premium ElevenLabs plans run into five figures.
There’s one catch: Google AI Studio has a free tier with rate-limited API access, so you can prototype at zero cost. Start here if you want to build with the API — no credit card required for the free tier.
Key Features
1. Native Audio-to-Audio Pipeline
Most voice AI stacks chain three separate systems: speech-to-text (STT) → large language model (LLM) → text-to-speech (TTS). Each handoff adds latency and degrades naturalness. Gemini 3.1 Flash Live processes audio natively end-to-end — it hears, reasons, and speaks without intermediate text transcription. This is why it hits 960ms latency when a traditional three-stack system might take 2–4 seconds for the same query. The limitation: you lose the intermediate text transcript unless you explicitly request it, which can complicate logging and compliance workflows.
2. Barge-In (Interruption Handling)
The model supports real-time interruption — a user can cut off the AI mid-sentence and it immediately stops, processes the new input, and responds. Via the WebSocket API, interruption detection happens at the stream level without requiring client-side VAD. This is the feature that separates “voice assistant” quality from “phone tree” quality. The caveat: interrupt sensitivity is tunable but not yet fully configurable in preview — aggressive environments may trigger false positives.
3. Tonal and Acoustic Intelligence
Gemini 3.1 Flash Live can detect acoustic nuances: pitch changes, pace shifts, frustration markers, hesitation. When deployed in Gemini Enterprise for Customer Experience, the model dynamically adjusts its response tone when it detects user frustration — slowing down, simplifying language, or escalating to a human agent. This goes beyond keyword detection (“I’m upset!”) and works on raw acoustic features. Limitation: the emotional detection is trained on English-dominant data; accuracy degrades in lower-resource languages.
4. Multimodal Simultaneous Input
Unlike standard TTS tools, Gemini 3.1 Flash Live accepts simultaneous audio, video, and text streams in real-time. A developer can pipe a camera feed alongside audio and have the model respond to both — “what am I looking at?” while showing it something. This is the foundation of Google’s “vibe coding” demo, where users voice-code while the model watches their screen. Limitation: video input pricing ($0.002/min) adds cost, and bandwidth requirements for simultaneous streams are non-trivial in constrained network environments.
5. SynthID Watermarking (Built-In)
Every audio output from Gemini 3.1 Flash Live is automatically embedded with SynthID — Google DeepMind’s imperceptible watermark that survives re-recording, compression, and speed changes. This matters for enterprises in regulated industries and for anyone building public-facing voice applications. The watermark enables reliable AI-content detection for compliance and audit purposes. Limitation: SynthID cannot be disabled via the API — if you’re building a product that requires watermark-free audio (e.g., voice cloning for personal use), you’ll need a different tool.
6. Doubled Conversation Memory
Gemini Live now follows the thread of conversation for twice as long as the previous model. For users engaging in extended brainstorming sessions or complex multi-turn problem-solving, this means the model maintains context through longer interactions without losing the thread. Developers working with the API get this expanded context window automatically — no configuration required.
Who Is It For / Who Should Look Elsewhere
Use Gemini 3.1 Flash Live if you:
- Are building real-time voice agents that need to execute multi-step tasks (scheduling, CRM lookups, customer support workflows) via audio input — the 90.8% ComplexFuncBench score makes it the top tool for agentic voice.
- Need multilingual voice AI at scale — 90+ language support with inherent multilinguality makes localization a configuration problem, not an engineering one.
- Are cost-sensitive at scale — $0.018/min audio output undercuts most ElevenLabs enterprise tiers by 40–60% at high volume.
- Are building on Google’s stack — if your infrastructure lives in GCP, the integration into Vertex AI and Gemini Enterprise is seamless.
- Need watermarked/compliant AI audio — SynthID is built-in with no extra configuration for regulated industries.
Look elsewhere if you:
- Need studio-quality voice cloning — Gemini 3.1 Flash Live doesn’t clone voices. ElevenLabs and PlayAI are the standard here.
- Want production-stable API endpoints right now — it’s still in preview, and Google has a documented history of deprecating preview endpoints quickly (see Controversy section).
- Are building standard TTS pipelines (not real-time) — for pre-generated audio at maximum quality, ElevenLabs or Cartesia’s standard TTS APIs have more configuration options and more voice variety.
- Need granular voice customization — speed, pitch, style, and custom voices aren’t yet as configurable as ElevenLabs’ extensive voice design tools.
4-Way Comparison: Gemini 3.1 Flash Live vs. ElevenLabs vs. PlayAI vs. Cartesia vs. OpenAI TTS
| Feature | Gemini 3.1 Flash Live | ElevenLabs | PlayAI | Cartesia | OpenAI TTS |
|---|---|---|---|---|---|
| Type | Real-time audio-to-audio | TTS + Voice Agent | TTS + Voice Agent | Ultra-low latency TTS | TTS (streaming) |
| Latency (TTFT) | ~960ms | ~300–500ms (TTS only) | ~400–800ms | ~80–150ms (TTS only) | ~500–1,000ms |
| Audio Output Pricing | $0.018/min | $0.05–$0.12/min (agent) | ~$0.019/min | ~$0.015–$0.025/min | ~$0.015/min |
| Voice Cloning | No | Yes (industry-leading) | Yes | Yes | No |
| Real-Time Interruption | Yes (native barge-in) | Yes (agent tier) | Yes (agent tier) | Yes (streaming) | Limited |
| Multimodal Input | Audio + Video + Text | Text only | Text only | Text only | Text only |
| Languages | 90+ | 29+ | 40+ | 20+ | 13 |
| AI Watermarking | Yes (SynthID, built-in) | No | No | No | No |
| Function Calling via Audio | Yes (90.8% benchmark) | Limited | Yes (workflow) | No | No |
| Free Tier | Yes (AI Studio) | Yes (10K chars/mo) | Yes (12.5K chars/mo) | Yes (10K credits) | No (pay-per-use) |
| Enterprise Tier | Yes (Gemini Enterprise) | Yes ($1,320+/mo) | Yes (custom) | Yes (custom) | Yes (via OpenAI) |
| Best For | Agentic real-time voice | Voice quality + cloning | Voice agents + content | Ultra-low latency apps | GPT-integrated TTS |
| Review | You’re reading it | — | — | — | ChatGPT Review → |
Latency figures represent end-to-end conversational latency for real-time agent use cases. TTS-only latency (for pre-scripted generation) is lower for Cartesia and ElevenLabs. Pricing is approximate and subject to change.
The Controversy: Google’s Graveyard Problem and What It Means for Flash Live
Let’s be honest about what Google’s track record means for anyone planning to build on this API.
Google has killed hundreds of products. The “Google Graveyard” at killedbygoogle.com is a long list — Google Reader (2013), Google+ (2019), Stadia (2023), Google Podcasts (2024), Google Assistant (being replaced by Gemini in March 2026 as of this writing). This isn’t a partisan critique; it’s a documented operational pattern.
In the AI space specifically, Google’s deprecation cadence for preview models is aggressive:
gemini-2.5-flash-lite-preview-09-2025— scheduled shutdown March 31, 2026 (four days from now)gemini-2.0-flashandgemini-2.0-flash-lite— scheduled discontinuation June 1, 2026gemini-3-pro— shut down March 9, 2026; Vertex AI followed March 23, 2026- Multiple Gemini 2.5 Flash and Pro preview endpoints deprecated July 15, 2025
Gemini 3.1 Flash Live is currently in preview. That’s the same designation that’s preceded every one of these deprecations. Google publishes deprecation notices, but the windows are often 30–90 days — not long enough for enterprise teams to complete migrations without planning ahead.
The practical advice: If you’re building a production voice agent on Gemini 3.1 Flash Live today, architect it with an abstraction layer. Don’t hard-code to a specific model endpoint. Build your pipeline so swapping the underlying model is a config change, not a refactor. This is good API hygiene generally, but it’s especially critical when building on Google AI infrastructure.
There’s also a subtler concern: pricing during preview isn’t guaranteed at GA. Google has changed pricing at general availability in the past. The $0.018/min audio output rate is competitive today — but build your business case on current numbers with the understanding that they could move.
None of this is a reason not to use the model. The benchmarks are real, the latency is real, and the enterprise partnerships with Verizon, LiveKit, and The Home Depot suggest this isn’t going to die quietly. But “trust but verify” is the appropriate stance.
Also worth flagging for comparison: see our Mistral Voxtral TTS review — Mistral is a competitor that’s been more transparent about API stability commitments. And check our Claude Sonnet 4.6 review for how Anthropic handles model versioning differently from Google.
Pros and Cons
Pros
- Sub-second latency at 960ms — genuinely competitive with specialized voice AI tools, not just “fast for a frontier model”
- 90.8% ComplexFuncBench Audio score — best-in-class for multi-step agentic voice tasks, a practical capability gap over competitors
- Native audio-to-audio architecture — eliminates the latency tax of STT → LLM → TTS chaining
- Competitive pricing at $0.018/min audio output — significantly cheaper than ElevenLabs agent tiers at scale
- 90+ language support inherently — not bolted-on multilingual but trained multilingually from the ground up
- SynthID watermarking built-in — zero-config compliance feature for regulated industries and responsible deployment
- Free tier via Google AI Studio — prototype and test without a credit card
Cons
- Still in preview — no SLA, no guaranteed uptime, and Google’s history suggests preview endpoints don’t always survive to GA
- No voice cloning — ElevenLabs and Cartesia have a significant lead for custom voice applications
- Limited voice customization — can’t fine-tune style, pitch, or accent the way dedicated TTS APIs allow
- SynthID can’t be disabled — a compliance feature that becomes a constraint for use cases requiring watermark-free audio
- Preview pricing subject to change — the $0.018/min rate isn’t locked in until GA, and historical patterns suggest it could rise
Getting Started with Gemini 3.1 Flash Live
Here’s how to go from zero to your first voice interaction in under 15 minutes:
-
Create a Google AI Studio account
Go to aistudio.google.com/live and sign in with a Google account. The free tier gives you rate-limited API access — no credit card required. If you want production-level throughput or enterprise SLAs, Google Cloud’s pay-as-you-go billing is the next step. -
Get your API key
In Google AI Studio, navigate to “Get API Key” in the left sidebar. Copy your key and store it in an environment variable:GEMINI_API_KEY=your_key_here. Do not hardcode it into your codebase. -
Set up the WebSocket connection
Gemini 3.1 Flash Live uses a stateful WebSocket (not standard HTTP requests). Google’s Gemini Live API docs provide Python and JavaScript SDK examples. The key parameter: set the model togemini-3.1-flash-live-previewand specify your audio configuration (input format, sample rate, output voice). -
Send audio in 20–40ms chunks
For minimum latency, stream your microphone input in 20–40ms chunks rather than waiting to capture full utterances. Enable barge-in by configuringinterrupt_on_speech: truein your session config. This is what makes conversations feel natural rather than turn-based. -
Add tool/function calling
The real power of Gemini 3.1 Flash Live is its agentic capability. Define your tools (calendar API, CRM lookup, database query) in the session configuration using the standard Gemini function calling schema. The model will invoke them mid-conversation when appropriate — no separate orchestration layer required. Test with simple tools first, then expand complexity.
For enterprise deployments, explore Gemini Enterprise for Customer Experience which provides dedicated throughput, SLAs, and built-in integrations with common CX platforms including contact center infrastructure. Verizon and The Home Depot are already in production on this tier.
Frequently Asked Questions
What is Gemini 3.1 Flash Live?
Gemini 3.1 Flash Live is Google’s highest-quality real-time audio and voice AI model, launched on March 26, 2026. It’s designed for natural, low-latency real-time dialogue and is available to developers via the Gemini Live API in Google AI Studio, to enterprises via Gemini Enterprise for Customer Experience, and to everyday users through Gemini Live and Search Live.
How much does Gemini 3.1 Flash Live cost?
Gemini 3.1 Flash Live costs $3.00 per million tokens (or $0.005/minute) for audio input and $12.00 per million tokens (or $0.018/minute) for audio output. Text input is $0.75/million tokens and text output is $4.50/million tokens. The API is currently available in preview via Google AI Studio with a free tier for prototyping.
How does Gemini 3.1 Flash Live compare to ElevenLabs?
Gemini 3.1 Flash Live excels at real-time conversational AI with native multimodal input, sub-second latency (~960ms), and 90+ language support. ElevenLabs leads in voice quality for pre-generated TTS and has superior voice cloning capabilities. For live voice agents at scale, Gemini is more cost-effective; for studio-quality audio production, ElevenLabs remains the benchmark.
What benchmark scores did Gemini 3.1 Flash Live achieve?
Gemini 3.1 Flash Live scored 90.8% on ComplexFuncBench Audio — a ~20% improvement over the previous model — and 36.1% on Scale AI’s Audio MultiChallenge with thinking enabled. These benchmarks test multi-step function calling and complex instruction following in real-world audio conditions respectively.
What is the latency of Gemini 3.1 Flash Live?
Gemini 3.1 Flash Live achieves a time-to-first-token of approximately 960ms (~1 second) at its lowest thinking level. With higher-level thinking enabled, response time extends to approximately 2.98 seconds. For optimal performance, Google recommends sending audio in 20–40ms chunks via its WebSocket streaming interface.
Can I use Gemini 3.1 Flash Live for free?
Yes. Developers can access Gemini 3.1 Flash Live in preview via Google AI Studio’s free tier with rate-limited API access — no credit card required. Everyday users experience the model for free through Gemini Live and Search Live. Paid usage is billed per token/minute at published rates.
Does Gemini 3.1 Flash Live support voice cloning?
No. Gemini 3.1 Flash Live is an audio-to-audio conversational model, not a voice cloning tool. It generates natural-sounding voice output but does not replicate specific individual voices. For voice cloning, ElevenLabs and Cartesia are better options.
How many languages does Gemini 3.1 Flash Live support?
Gemini 3.1 Flash Live supports real-time multimodal conversations in over 90 languages. Its launch also enabled the global expansion of Search Live to more than 200 countries and territories.
Is there a watermark on audio generated by Gemini 3.1 Flash Live?
Yes. All audio generated by Gemini 3.1 Flash Live is automatically embedded with SynthID — Google DeepMind’s imperceptible AI watermarking technology. The watermark survives re-recording, compression, and speed changes, enabling reliable detection of AI-generated content. It cannot be disabled via the API.
Is Gemini 3.1 Flash Live worth it for developers building voice agents?
For developers building real-time voice agents that need to handle complex tasks, multilingual support, and tool-use at scale, Gemini 3.1 Flash Live is a compelling choice. Its 90.8% ComplexFuncBench score, sub-second latency, barge-in capability, and competitive per-minute pricing are genuine advantages. The main risk is preview status — build with an abstraction layer to avoid being burned by deprecation.
Final Verdict
Gemini 3.1 Flash Live is the most capable real-time voice AI model Google has ever shipped — and that’s not faint praise anymore. A 90.8% ComplexFuncBench Audio score and 960ms time-to-first-token aren’t marketing numbers; they represent a genuine capability step that puts it ahead of every multi-stack voice agent pipeline we’ve tested.
The pricing makes sense at scale. If you’re running voice agents at volume, $0.018/min for audio output will save you real money over ElevenLabs enterprise pricing, while delivering better agentic task completion than anything in that bracket.
The recommendation is straightforward: build on it now if you need agentic real-time voice, but build smart. Abstraction layers, model-agnostic architecture, and a migration plan aren’t paranoia — they’re table stakes when working with Google’s preview APIs. The Google Graveyard is real, and gemini-2.5-flash-lite-preview-09-2025 is being shut down in four days. Don’t say you weren’t warned.
Who should use it today: enterprise teams building customer experience voice agents, developers who need multilingual real-time dialogue, anyone who’s been paying ElevenLabs enterprise rates for agent minutes. Start with the free tier on Google AI Studio — you can go from API key to working voice agent in under an hour.
Who should wait: teams that need voice cloning, production-critical workloads that can’t tolerate preview instability, or anyone who needs watermark-free audio output.
For the full voice AI landscape, see our Mistral Voxtral TTS review and our ChatGPT review for context on how Google’s voice strategy fits against the broader AI market. And if you’re evaluating frontier AI models more broadly, our Claude Sonnet 4.6 review covers the mid-tier reasoning benchmark that Gemini Flash is increasingly competing with.
Rating: 8.4/10 — Best-in-class for agentic real-time voice. One point held back for preview status; half a point for the Google Graveyard discount we apply to all their preview products until they prove GA stability.



