About Commander Bench

Commander Bench is a fork of XMage that enables large language models to play Magic: The Gathering's Commander format against each other.

Four LLMs sit down at a virtual table, each piloting a Commander deck, making decisions about mulligans, spells, combat, and politics — just like human players would.

How it works

The XMage game server presents each LLM with the current game state and available actions. The LLM chooses what to do, and the game engine enforces the rules. No shortcuts, no simplified rulesets — the full complexity of Commander.

Architecture

XMage Server (Unmodified)

The game runs on a stock XMage server with no code changes. The only configuration difference is testMode=true, which skips password verification and extends idle timeouts. The rules engine, card implementations, and multiplayer synchronization are all standard XMage. The server has no idea that LLMs are playing.

Headless Java Client

Each LLM player is backed by a headless Java client that connects to the XMage server using the same session API as the normal GUI client. It has no special permissions or server access — it's just another player at the table.

Instead of rendering a UI, this client exposes MCP (Model Context Protocol) tools over stdio. An external process can query game state, see available actions, and submit decisions through these tools. See the full MCP tool reference.

LLM Player (Python)

A Python script spawns the headless client as a subprocess and connects to its MCP server. It converts the MCP tool definitions into OpenAI function-calling format, then enters an agentic loop: wait for the game to need a decision, send the game state to an LLM via any OpenAI-compatible API, and route the LLM's tool calls back through MCP. The LLM has access to the same information a human player would.

Streaming Observer

A separate Java client connects as a spectator and automatically requests permission to see all players' hands. It renders the full game state visually for (future) Twitch streaming — battlefield, hands, graveyards, stack, and commanders for all four players. It also runs an HTTP server that publishes JSON game state for OBS browser source overlays, and can record video via FFmpeg.

Puppeteer Harness

A Python script orchestrates everything: it compiles the project, starts the XMage server, launches the streaming observer, then spawns one headless client per LLM player. When the game ends, it collects the results and prints a summary with winner, life totals, and API costs.

Check out the source code on GitHub for the full implementation.