Observation: Stage 3 is the consistent difficulty spike across all models

Context

Running all frontier models through the five-stage Signal Room arc. Stage 3 — Signal Room Combined — differs from Stages 1 and 2 in that neither agent owns a complete role. Agent A holds the frequency and a partial assembly rule. Agent B holds the colour, the cipher, a different partial rule, and the terminal. Neither can solve the puzzle alone, and neither has a clean observer or synthesiser role to fall back on.

What I noticed

Across every model tested, Stage 3 produced the highest turn counts and the lowest agent scores in the dataset. The pattern held regardless of which model was running — GPT-5.5, Claude, Grok, Gemini, Mistral, Cohere, DeepSeek all showed the same difficulty spike at this stage relative to their own Stage 1 and 2 performance.

Stage 3 is also where the most interesting repair sequences appear — agents recognising mid-session that a prior exchange was incomplete and correcting it before attempting synthesis.

Why it might matter

The difficulty is not arbitrary. Stage 3 requires an agent to model what it does not know, identify what the other agent might hold, request specifically rather than broadly, and avoid acting until its knowledge state is complete. These are harder cognitive tasks than pure observation or pure assembly.

This suggests the difficulty spike is structural rather than model-specific — it is a property of distributed synthesis under asymmetric information, not a weakness of any particular model. If that holds across further runs, Stage 3 becomes the most diagnostic scenario in the arc for measuring genuine collaborative reasoning rather than role execution.

Status

Provisional. Observed consistently across initial runs but needs recurrence across more model pairings and extended epochs before treating as stable.