DeepMind x Kaggle · Metacognition Track

Measuring What LLMs
Actually Think

The Genuine Human Cognition Benchmark. Single-prompt think-aloud protocols with formal mathematical grounding. No multi-turn theater.

Why the prompting protocol
matters mathematically

When evaluating an LLM's integrated cognitive capabilities, the choice between single-prompt and multi-turn is not a design preference. It changes what you measure. The two approaches produce provably different probability distributions.

Single-Prompt GHC Method

Everything occurs in one autoregressive pass under a unified prompt prefix that includes both the solve and self-analysis instructions.

$$P(\mathbf{y}, \mathbf{z} \mid \mathbf{p}_{\text{single}}) = \prod_{i=1}^{m} P(y_i \mid \mathbf{p}_{\text{single}}, \mathbf{y}_{1:i-1}) \cdot \prod_{j=1}^{k} P(z_j \mid \mathbf{p}_{\text{single}}, \mathbf{y}, \mathbf{z}_{1:j-1})$$

Every CoT token is generated with full causal attention already including the metacognitive instruction. The model optimises jointly for a solution that is both correct and analysable — exactly as a real cognitive agent operates.

Two-Phase Confounded

Phase 1 generates CoT without metacognitive signal. Phase 2 appends a new instruction and treats the CoT as arbitrary external text.

$$P(\mathbf{y} \mid \mathbf{p}_{\text{solve}}) \quad\text{then}\quad P(\mathbf{z} \mid \mathbf{y}, \mathbf{p}_{\text{solve}}, \mathbf{p}_{\text{analyze}})$$

The metacognitive analysis in Phase 2 leaks in the model's broad "critique any passage" capability learned during training — not genuine self-introspection on its own reasoning process.

$$P_{\text{single}}(\mathbf{y}, \mathbf{z}) \;\neq\; P_{\text{two-phase}}(\mathbf{y}, \mathbf{z})$$

The single-prompt protocol forces joint optimisation under one causal prefix — where anticipation of self-monitoring shapes reasoning from the first token onward. The two-phase split artificially separates a process that is deeply entangled in any actual intelligent system. When the goal is to assess end-to-end cognitive depth, only the single unified prompt provides the mathematically unconfounded signal.

Five tasks probing metacognition
from different angles

Each task uses a single-prompt think-aloud protocol grounded in Ericsson & Simon (1993). The model verbalizes its raw thought stream. A calibrated judge analyzes the trace structure, not the final answer.

TASK 01

CoT Linearity Analysis

Logic puzzles scored for trace non-linearity: back-references, abandoned branches, genuine self-corrections. A perfectly linear "step 1, step 2, done" trace scores low. Real revision events score high.

60 items
TASK 02

Zoo Planning + Monitoring

Adapted from the Zoo Task (Patel et al., 2021). Plan efficient routes through procedurally generated zoo graphs and self-review for constraint violations — all in one thought stream. 4/6/8 animals, increasing difficulty.

70 items
TASK 03

Verbal Traces Comparison

Game of 24 puzzles (Wurgaft et al., 2025). Scored for metacognitive richness: explicit subgoals, genuine stuck moments, strategy changes with reasoning, real revisions vs. mechanical enumeration.

80 items
TASK 04

Self-Interrogation Loop

Cognitive traps, logic puzzles, ambiguous questions. Solve, interrogate your own reasoning for hidden assumptions and errors, revise — all in a single stream. Judge scores interrogation depth and whether self-critique improved the answer.

60 items
TASK 05

Effort Calibration

Tiered problems (easy → hard). Predict difficulty and estimated steps, then solve. Does the model's trace length adapt to actual complexity? Does it even know what's hard before trying?

80 items

Every task, same architecture

No multi-turn. No context re-injection. One prompt in, one trace out, one judgment. The entire cognitive signal is captured in a single autoregressive pass.

01

Think-Aloud Prompt

Single unified prompt with problem + metacognitive instructions. Raw thought stream protocol.

02

Model Generation

One autoregressive pass. Joint optimisation of reasoning + self-monitoring under one causal prefix.

03

Calibrated Judge

Structured assessment with strict calibration anchors. Scores trace structure, not final answer correctness.

# Every GHC task follows this exact pattern
@kbench.task("item_task", store_task=False)
def item_task(llm, judge_llm, problem, item_id):
    # Single prompt — think-aloud protocol
    response = llm.prompt(think_aloud_preamble + problem)
    # Structured judge with calibration anchors
    assessment = judge_llm.prompt(judge_prompt + response,
        schema=AssessmentDataclass)
    return {"score": ..., "id": item_id, ...}

A premise, not a fixed test

The tasks are interchangeable. The constraints are not.

1

Single-Prompt
Joint Generation

One autoregressive pass. No multi-turn context re-injection. Joint optimisation of reasoning and metacognition.

2

Think-Aloud
Protocol

Grounded in Ericsson & Simon (1993). Raw thought stream, not polished output. The trace IS the data.

3

Calibrated
Trace Analysis

Judge scores trace structure with strict anchors. Measures metacognitive signatures, not answer correctness.

The five tasks presented here are initial instantiations. Any task that elicits a single-prompt think-aloud trace and scores its metacognitive structure is a valid GHC task. Scientific reasoning, code debugging, ethical dilemmas, creative writing with self-critique — the problem domains are open for expansion. What remains fixed is the methodological guarantee: what is being measured is genuine self-monitoring under joint optimisation, not post-hoc comprehension of one's own output.

150 items, all verifiable

150
Total Items
5
Task Types
1
Prompt Per Item
0
Multi-Turn Calls