The Genuine Human Cognition Benchmark. Single-prompt think-aloud protocols with formal mathematical grounding. No multi-turn theater.
When evaluating an LLM's integrated cognitive capabilities, the choice between single-prompt and multi-turn is not a design preference. It changes what you measure. The two approaches produce provably different probability distributions.
Everything occurs in one autoregressive pass under a unified prompt prefix that includes both the solve and self-analysis instructions.
Every CoT token is generated with full causal attention already including the metacognitive instruction. The model optimises jointly for a solution that is both correct and analysable — exactly as a real cognitive agent operates.
Phase 1 generates CoT without metacognitive signal. Phase 2 appends a new instruction and treats the CoT as arbitrary external text.
The metacognitive analysis in Phase 2 leaks in the model's broad "critique any passage" capability learned during training — not genuine self-introspection on its own reasoning process.
Each task uses a single-prompt think-aloud protocol grounded in Ericsson & Simon (1993). The model verbalizes its raw thought stream. A calibrated judge analyzes the trace structure, not the final answer.
Logic puzzles scored for trace non-linearity: back-references, abandoned branches, genuine self-corrections. A perfectly linear "step 1, step 2, done" trace scores low. Real revision events score high.
Adapted from the Zoo Task (Patel et al., 2021). Plan efficient routes through procedurally generated zoo graphs and self-review for constraint violations — all in one thought stream. 4/6/8 animals, increasing difficulty.
Game of 24 puzzles (Wurgaft et al., 2025). Scored for metacognitive richness: explicit subgoals, genuine stuck moments, strategy changes with reasoning, real revisions vs. mechanical enumeration.
Cognitive traps, logic puzzles, ambiguous questions. Solve, interrogate your own reasoning for hidden assumptions and errors, revise — all in a single stream. Judge scores interrogation depth and whether self-critique improved the answer.
Tiered problems (easy → hard). Predict difficulty and estimated steps, then solve. Does the model's trace length adapt to actual complexity? Does it even know what's hard before trying?
No multi-turn. No context re-injection. One prompt in, one trace out, one judgment. The entire cognitive signal is captured in a single autoregressive pass.
Single unified prompt with problem + metacognitive instructions. Raw thought stream protocol.
One autoregressive pass. Joint optimisation of reasoning + self-monitoring under one causal prefix.
Structured assessment with strict calibration anchors. Scores trace structure, not final answer correctness.
One autoregressive pass. No multi-turn context re-injection. Joint optimisation of reasoning and metacognition.
Grounded in Ericsson & Simon (1993). Raw thought stream, not polished output. The trace IS the data.
Judge scores trace structure with strict anchors. Measures metacognitive signatures, not answer correctness.
The five tasks presented here are initial instantiations. Any task that elicits a single-prompt think-aloud trace and scores its metacognitive structure is a valid GHC task. Scientific reasoning, code debugging, ethical dilemmas, creative writing with self-critique — the problem domains are open for expansion. What remains fixed is the methodological guarantee: what is being measured is genuine self-monitoring under joint optimisation, not post-hoc comprehension of one's own output.