Benchmark for LLM Agents

PTCG-Bench: Can LLM Agents Master Pokémon Trading Card Game?

Dongdong Hua¹ , Yifei Sun¹ , Renhong Huang¹ , Feng Gao² , Chunping Wang² , Yang Yang¹

¹ Zhejiang University ² FinVolution Group

PTCG-Bench evaluates language-model agents in a full trading card game setting where agents must reason about hidden information, long-horizon strategy, textual card effects, numerical state, and experience-driven self-improvement.

arXiv PDF Code Engine

10 Agent Backbones

5 Model Families

781 Rating Gap

8 Evolution Rounds

Leaderboard

Anchored ratings across PTCG-Bench agents.

Ratings are shown on the anchored Glicko-2 scale with win/loss records from the referenced benchmark run.

Rank	Agent	Family	Rating	Deviation	Record	Win Rate	AVG. Cost
#1	Gemini 3.1 Pro Preview	Google	1854	±69	47-8	85%	$1.15
#2	DeepSeek V4 Pro	DeepSeek	1727	±69	40-15	73%	$0.23
#3	DeepSeek V4 Flash	DeepSeek	1727	±69	40-15	73%	$0.09
#4	Gemini 3 Flash Preview	Google	1636	±69	35-20	64%	$0.25
#5	Claude Sonnet 4.6	Anthropic	1600	±69	33-22	60%	$1.94
#6	GPT-5.4	OpenAI	1564	±69	31-24	56%	$0.68
#7	Qwen 3.6 Plus	Qwen	1509	±69	28-27	51%	$0.17
#8	Claude Haiku 4.5	Anthropic	1473	±69	26-29	47%	$0.90
#9	Qwen 3.5 Flash 02-23	Qwen	1382	±69	21-34	38%	$0.05
#10	GPT-5.4 Nano	OpenAI	1237	±69	13-42	24%	$0.04
#11	Charizard Heuristic	Heuristic	1219	±69	12-43	22%	N/A
#12	Random	Baseline	1073	±69	4-51	7%	N/A

Overview

A controlled game environment for strategic agent evaluation.

A Pokémon Trading Card Game benchmark for evaluating LLM agents in strategic, imperfect-information play.

A longitudinal protocol for measuring whether agents improve through accumulated cross-game experience.

A modular harness design that separates backbone capability from observation, action, and context-management choices.

PTCG-Bench environment overview showing card attributes and the two-player board — Environment overview: agents act on a partial board state while reasoning over card text, numerical attributes, hidden zones, and opponent behavior.

Interface

Agent-environment loop

The engine exposes state and legal actions; the harness converts them into model-readable context; the model returns a structured action request; and the engine validates and executes it.

Agent-environment interaction loop in PTCG-Bench

Protocol

Self-evolution over sequential play

Self-evolving agents play repeated games against fixed anchors, update persistent state between rounds, and are evaluated on a stable rating scale across snapshots.

Findings

PTCG-Bench separates model strength, harness effects, and self-evolution limits.

Experiments show a broad rating distribution across LLM agents, non-monotonic cost-performance trade-offs, and substantial performance changes from harness ablations such as legal-action masking and history context.

The anchored self-evolution study shows that current memory, reflection, prompt-evolution, and skill-library mechanisms do not yet produce stable monotonic improvement across sequential play, highlighting the difficulty of converting long-horizon game experience into reusable strategy.

Glicko-2 ratings and pairwise win-rate heatmap for PTCG-Bench agents

Rating trajectories for self-evolving agent configurations

Citation

@misc{hua2026ptcgbenchllmagentsmaster,
  title={PTCG-Bench: Can LLM Agents Master Pok\'emon Trading Card Game?},
  author={Dongdong Hua and Yifei Sun and Renhong Huang and Feng Gao and Chunping Wang and Yang Yang},
  year={2026},
  eprint={2605.29653},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.29653},
}