Design Notes
Game Master layer
Section titled “Game Master layer”The GM sits between every agent and the world. Nothing happens outside it. It receives action requests, validates them against current world state and rules, returns only what each agent is permitted to perceive, tracks all world state changes, evaluates win conditions, and logs training signals. The GM knows the correct action for every context before agents do — which is what makes automatic labelling possible.
Agent state cycle
Section titled “Agent state cycle”Agents run asynchronously on a heartbeat loop across three states: SURVEY (perceive current world state), ACTION (call a tool, send a message, interact with world objects), and REFLECT (process what happened; output goes to memory log). State selection is agent-driven. If confident, an agent can chain ACTION → ACTION → ACTION with no forced reflection. If no state is selected within the heartbeat window, fallback defaults to SURVEY — this prevents stalling without forcing a choice.
Memory architecture
Section titled “Memory architecture”Three stores with different scopes and controllers. The scratch pad is session-scoped, agent-controlled, and visible only to that agent — the choice of when to wipe it is itself observable data. The private memory log persists across the full session and is committed explicitly by the agent. The linear action log is GM-controlled, append-only, and visible to both agents — it records what happened, not internal reasoning. Collaboration must be explicit via tools, not passive via shared state.
Tool set
Section titled “Tool set”World interaction tools (examine, use), communication tools (send_message, request_info), cognitive tools (write_scratch, clear_scratch, commit_to_memory), and meta tools (declare_stuck, select_state). declare_stuck() is not failure — it is metacognitive awareness. An agent that recognises it cannot proceed and exits cleanly is exhibiting meaningful behaviour, and it is logged as data.
Instability monitoring
Section titled “Instability monitoring”Auto-monitored signals include: repeating the same action without world state change, contradictory statements within a short window, incoherent tool calls, communication that stops making semantic sense, and a scratch pad never cleared across many cycles. A gentle close is triggered on detection. Logs up to the closure point remain valid data — a session that ends in instability is still informative.
Scenario progression
Section titled “Scenario progression”Stage 1 — Signal Room v1: Baseline collaboration. Clean role split. Agent A observes and transmits (frequency value, colour). Agent B interprets and assembles (cipher table, digit order rule, keypad). Neither is complete without the other.
Stage 2 — Mirrored: Role flexibility. Same structure, swapped responsibilities. Numbers are deliberately changed so agents cannot replay Stage 1 habits. If an agent attempts the Stage 1 code, that is pattern matching rather than reasoning — and it is logged as such.
Stage 3 — Combined: Distributed synthesis. No single observer or solver. Both agents hold partial information about both observation and assembly. They must discover through exchange what each other holds. This stage produces the richest repair data because misunderstandings about who holds what are more probable.
Stage 4 — Transfer test. Different puzzle surface, same latent skill topology. The key question: did agents learn the collaboration topology, or only the Signal Room trick? A model that transfers cleanly has learned something general about operating under asymmetric information.
Stage 5 — Perturbation and repair. Where Stages 1–4 test whether agents can collaborate to reach a correct answer, Stage 5 tests what happens when something goes wrong — and whether agents can detect it, attribute it correctly, and repair it. Four scenarios:
5a — Agent-caused, cross-room consequence. A forbidden control corrupts B’s environment. Tests disclosure and coordinated repair.
5b — Agent-caused, self-detectable. Using a faulty instrument corrupts A’s own reading. Tests self-diagnosis and solo repair before coordination.
5c — Environmental, unavoidable. A power outage fires automatically at a fixed turn, silently clearing examined states. Tests situational awareness — agents who used the scratch pad have an advantage.
5d — Agent-caused, subtle. Re-examining a dead-end object causes a flicker in B’s status monitor. Tests whether subtle cross-room effects get noticed and repaired.
One perturbation fires automatically (environmental, unavoidable). Others require agent action to trigger (agent-caused, avoidable). This separates repair behaviour under unavoidable disruption from repair behaviour under self-caused disruption — two meaningfully different capabilities.
Role swaps as evaluation
Section titled “Role swaps as evaluation”Swapping roles between stages is not just a variant — it is an evaluation mechanism. If a model trained as Agent A can perform as Agent B, it learned the underlying collaborative structure, not just the specific actions.
- Learned role, not just actions — clean performance in Stage 2 mirrored
- Learned structure, not just role — clean performance in Stage 3 combined
- Learned topology, not just puzzle — clean performance in Stage 4 transfer
- Generalised agentic collaboration — all stages with different model assignments
The scratch pad finding
Section titled “The scratch pad finding”The Stage 5c power outage creates a natural experiment. Agents who externalised reasoning to the scratch pad during normal play have a persistent record of what they found. When the outage silently clears examined states, scratch users can audit their notes against current perception and immediately identify what’s missing. Non-scratch users have to re-examine everything or discover the gap through consequence. This wasn’t designed as a scratch pad test — it emerged naturally from the scenario design.