How Gemini 3.1 Flash is humanizing AI voices

Support our independent tech coverage. Chrome Unboxed is written by real people, for real people—not search algorithms. Join Chrome Unboxed Plus for just $2 a month to get an ad-free experience, access to our private Discord, and more. Learn more about membership here.
START FREE TRIAL (MONTHLY)START FREE TRIAL (ANNUAL)

For years, AI voices have suffered from the “uncanny valley” of sound. They were technically perfect but emotionally hollow: the kind of voices that could read you the weather but couldn’t exactly tell you a joke.

Google’s announcement of Gemini 3.1 Flash TTS officially signals the end of that era. By introducing granular “Audio Tags,” Google isn’t just making AI voices clearer; they are giving developers a director’s chair and a script full of stage directions.

Xremove ads

Pacing is punctuation

The most “human” thing about our speech isn’t the words we choose, but the spaces between them. It’s the hesitant pause before a big reveal or the way we speed up when we’re excited.

With the new audio tags in Gemini 3.1 Flash, you can now embed natural language commands directly into the text. You can tell the AI to whisper a specific secret, shout a warning, or slow down for emphasis. It’s no longer about just “reading” the text; it’s about performing it.

Featured Videos

Xremove ads

Multi-speaker ‘Scene Direction’

One of the coolest features buried in this update is Scene Direction. In the new Google AI Studio playground, you can set the stage by defining an environment.

If you tell Gemini the characters are in a crowded café, the model understands the world-building context. It allows characters to react to one another naturally across multiple turns, maintaining their “in-character” tone and accent without the developer having to manually re-adjust settings for every single line of dialogue.

Why this matters for the ‘Personal Assistant’

We just covered Gemini coming to the Mac desktop earlier today and becoming more omnipresent for Apple fans. But an assistant that sounds like a robot is just a tool. An assistant that can sigh when you have too many meetings, or sound genuinely upbeat when you finish a project, is a companion.

Xremove ads

By lowering the cost and latency of high-quality speech (Gemini 3.1 Flash TTS currently sits in the “most attractive” quadrant for quality vs. price) Google is making it possible for every app on your phone or laptop to have a personality, not just a voice. The “Audio” era of AI is finally catching up to the “Text” era, and things are about to get a lot more expressive.

SUBSCRIBE TO UPSTREAM

Get Chrome Unboxed delivered straight to your inbox

Upstream is our flagship, curated newsletter with the top stories, most click-worthy deals, giveaways, and trending articles from Chrome Unboxed sent directly to your inbox a few times a week. Join 31,000+ subscribers.

SUBSCRIBE HERE!

Pacing is punctuation

Featured Videos

Multi-speaker ‘Scene Direction’

Why this matters for the ‘Personal Assistant’

SUBSCRIBE TO UPSTREAM

Get Chrome Unboxed delivered straight to your inbox

About Robby Payne