Support our independent tech coverage. Chrome Unboxed is written by real people, for real people—not search algorithms. Join Chrome Unboxed Plus for just $2 a month to get an ad-free experience, access to our private Discord, and more. Learn more about membership here.
START FREE TRIAL (MONTHLY)START FREE TRIAL (ANNUAL)
For decades, the benchmark for machine intelligence was the chessboard. But as AI models move into the real world to act as assistants and collaborators, being a “super-calculator” isn’t enough. Decisions in the real world are messy, social, and based on incomplete information.
To address this, Google DeepMind and Kaggle have announced a massive update to Game Arena, their public AI benchmarking platform. Joining the existing chess leaderboard are two new, high-stakes challenges designed to test frontier models on social deduction and calculated risk: Werewolf and Poker.
Why use games?
As Demis Hassabis, CEO of Google DeepMind, puts it: the AI field needs “harder and more robust benchmarks.” While chess measures raw strategic reasoning, it is a game of “perfect information” where both players see the entire board.
In contrast, life (and enterprise work) involves navigating ambiguity. By pitting models against each other in games where they have to lie, detect deception, or manage risk with hidden cards, DeepMind is creating a “sandbox” to evaluate agentic safety and “soft skills” like negotiation and communication.
Werewolf: ultimate test of social intelligence
Werewolf is a social deduction game where a hidden minority (the werewolves) must deceive the majority (the villagers). Because the game is played entirely through natural language, it forces AI models to:
- Detect Deception: Villagers must analyze inconsistencies between what other players say and how they actually vote.
- Build Consensus: To win, models must persuade others and form alliances through dialogue.
- Execute Strategic Lies: Models playing as werewolves must maintain a “public story” while their private “thoughts” reveal a hidden plan.
Early results show that Gemini 3 Pro and Gemini 3 Flash are currently dominating the Werewolf leaderboard, proving remarkably adept at identifying “suspicious” patterns in their opponents’ behavior.
Poker: quantifying uncertainty
If Werewolf is about people, Poker is about probability. Specifically, the arena is using Heads-up No-limit Texas Hold’em to measure how well models manage risk.
In poker, a model can’t just rely on the luck of the draw; it has to infer what cards its opponent is holding and adapt its betting style on the fly. This translates directly to real-world applications like financial modeling or supply chain optimization, where decisions must be made without having all the facts.
Watch the AI Battle
To celebrate the launch, Google DeepMind has partnered with chess Grandmaster Hikaru Nakamura and poker legends like Liv Boeree and Doug Polk for a three-day livestream event. You can watch the action and see where your favorite models rank at kaggle.com/game-arena.
- Feb 2: Top models faced off in an AI poker battle.
- Feb 3: Poker semi-finals plus Werewolf and chess highlights.
- Feb 4 (Tomorrow): The final poker showdown and the release of the full, stable leaderboard.
SUBSCRIBE TO UPSTREAM
Get Chrome Unboxed delivered straight to your inbox
Upstream is our flagship, curated newsletter with the top stories, most click-worthy deals, giveaways, and trending articles from Chrome Unboxed sent directly to your inbox a few times a week. Join 31,000+ subscribers.

