Why Synthetic Environments?
Training smaller models for agentic tool use and multi-agent collaboration requires labelled examples of the behaviours you want to teach: identifying missing information, requesting it specifically, combining asymmetric clues, avoiding premature solving, repairing miscommunication, and acting only when ready. Those behaviours are difficult and expensive to collect from naturalistic human-AI interaction at scale.
A synthetic environment solves this by construction. The game is designed so that correct behaviour is unambiguous and the Game Master can label it automatically. Wrong answers are not just marked incorrect — each maps to a distinct failure mode requiring a different training correction. The difference between entering 473 and entering 437 is not the same mistake, and the training signal should reflect that.
The asymmetric information structure is the key design decision. Neither agent can solve the puzzle alone. Collaboration is not optional or encouraged — it is the only path to the win state. This means the environment doesn’t just observe whether agents collaborate; it creates conditions where collaboration is the only viable strategy, and then measures how agents find and execute it.
The environment also separates capabilities that are often conflated. Stages 1 through 4 are designed to isolate and then recombine distinct skills: baseline collaboration, role flexibility, distributed synthesis, and transfer to a new puzzle surface. Difficulty is not the point. Actions, choices, sequencing, and repair under asymmetric information are the point.