For decades, the benchmark for machine intelligence was the chessboard. But as AI models move into the real world to act as assistants and collaborators, being a “super-calculator” isn’t enough. Decisions in the real world are messy, social, and based on incomplete information.
To address this, Google DeepMind and Kaggle have announced a massive update to Game Arena, their public AI benchmarking platform. Joining the existing chess leaderboard are two new, high-stakes challenges designed to test frontier models on social deduction and calculated risk: Werewolf and Poker.
Why use games?
As Demis Hassabis, CEO of Google DeepMind, puts it: the AI field needs “harder and more robust benchmarks.” While chess measures raw strategic reasoning, it is a game of “perfect information” where both players see the entire board.
In contrast, life (and enterprise work) involves navigating ambiguity. By pitting models against each other in games where they have to lie, detect deception, or manage risk with hidden cards, DeepMind is creating a “sandbox” to evaluate agentic safety and “soft skills” like negotiation and communication.
Werewolf: ultimate test of social intelligence
Werewolf is a social deduction game where a hidden minority (the werewolves) must deceive the majority (the villagers). Because the game is played entirely through natural language, it forces AI models to:
- Detect Deception: Villagers must analyze inconsistencies between what other players say and how they actually vote.
- Build Consensus: To win, models must persuade others and form alliances through dialogue.
- Execute Strategic Lies: Models playing as werewolves must maintain a “public story” while their private “thoughts” reveal a hidden plan.
Early results show that Gemini 3 Pro and Gemini 3 Flash are currently dominating the Werewolf leaderboard, proving remarkably adept at identifying “suspicious” patterns in their opponents’ behavior.
Poker: quantifying uncertainty
If Werewolf is about people, Poker is about probability. Specifically, the arena is using Heads-up No-limit Texas Hold’em to measure how well models manage risk.
In poker, a model can’t just rely on the luck of the draw; it has to infer what cards its opponent is holding and adapt its betting style on the fly. This translates directly to real-world applications like financial modeling or supply chain optimization, where decisions must be made without having all the facts.
Watch the AI Battle
To celebrate the launch, Google DeepMind has partnered with chess Grandmaster Hikaru Nakamura and poker legends like Liv Boeree and Doug Polk for a three-day livestream event. You can watch the action and see where your favorite models rank at kaggle.com/game-arena.
- Feb 2: Top models faced off in an AI poker battle.
- Feb 3: Poker semi-finals plus Werewolf and chess highlights.
- Feb 4 (Tomorrow): The final poker showdown and the release of the full, stable leaderboard.
Join Chrome Unboxed Plus
Introducing Chrome Unboxed Plus – our revamped membership community. Join today at just $2 / month to get access to our private Discord, exclusive giveaways, AMAs, an ad-free website, ad-free podcast experience and more.
Plus Monthly
$2/mo. after 7-day free trial
Pay monthly to support our independent coverage and get access to exclusive benefits.
Plus Annual
$20/yr. after 7-day free trial
Pay yearly to support our independent coverage and get access to exclusive benefits.
Our newsletters are also a great way to get connected. Subscribe here!
Click here to learn more and for membership FAQ

