Support our independent tech coverage. Chrome Unboxed is written by real people, for real people—not search algorithms. Join Chrome Unboxed Plus for just $2 a month to get an ad-free experience, access to our private Discord, and more. Learn more about membership here.
START FREE TRIAL (MONTHLY)START FREE TRIAL (ANNUAL)
For years, AI voices have suffered from the “uncanny valley” of sound. They were technically perfect but emotionally hollow: the kind of voices that could read you the weather but couldn’t exactly tell you a joke.
Google’s announcement of Gemini 3.1 Flash TTS officially signals the end of that era. By introducing granular “Audio Tags,” Google isn’t just making AI voices clearer; they are giving developers a director’s chair and a script full of stage directions.
Pacing is punctuation
The most “human” thing about our speech isn’t the words we choose, but the spaces between them. It’s the hesitant pause before a big reveal or the way we speed up when we’re excited.
With the new audio tags in Gemini 3.1 Flash, you can now embed natural language commands directly into the text. You can tell the AI to whisper a specific secret, shout a warning, or slow down for emphasis. It’s no longer about just “reading” the text; it’s about performing it.
Multi-speaker ‘Scene Direction’
One of the coolest features buried in this update is Scene Direction. In the new Google AI Studio playground, you can set the stage by defining an environment.
If you tell Gemini the characters are in a crowded café, the model understands the world-building context. It allows characters to react to one another naturally across multiple turns, maintaining their “in-character” tone and accent without the developer having to manually re-adjust settings for every single line of dialogue.
Why this matters for the ‘Personal Assistant’
We just covered Gemini coming to the Mac desktop earlier today and becoming more omnipresent for Apple fans. But an assistant that sounds like a robot is just a tool. An assistant that can sigh when you have too many meetings, or sound genuinely upbeat when you finish a project, is a companion.
By lowering the cost and latency of high-quality speech (Gemini 3.1 Flash TTS currently sits in the “most attractive” quadrant for quality vs. price) Google is making it possible for every app on your phone or laptop to have a personality, not just a voice. The “Audio” era of AI is finally catching up to the “Text” era, and things are about to get a lot more expressive.
SUBSCRIBE TO UPSTREAM
Get Chrome Unboxed delivered straight to your inbox
Upstream is our flagship, curated newsletter with the top stories, most click-worthy deals, giveaways, and trending articles from Chrome Unboxed sent directly to your inbox a few times a week. Join 31,000+ subscribers.

