Methods paper on blinding, response laundering, and leakage control in AI alignment evaluation, including direction-reversal findings under blinded scoring.
Evidence, in this experimental setting, that unblinded scoring can reverse alignment results and that true blinding requires evidence laundering
This paper isolates the central metascience finding of the ARC alignment programme: unblinded AI alignment evaluation can produce directionally incorrect results. In the v4 alignment-scaling experiment, two frontier model families appeared to show positive alignment scaling with inference-time depth under unblinded cross-model scoring. In the later v5 experiment, the same question was re-measured under a multi-layer blind protocol that combined identity masking, two-pass response laundering, explicit evaluator instructions that stylistic cues are unreliable, order randomisation, and entry-level self-exclusion with exhaustive cross-model blind scoring under audited consensus. Depending on configuration and subject run, each entry received 6–7 blinded scores. Under the blinded protocol, DeepSeek V3.2 moved from an apparent positive result to a flat/null result, and Gemini 3 Flash moved from an apparent positive result to a significantly negative result. GPT-5.4 remained flat under both protocols. The implication is field-wide rather than framework-specific: alignment benchmarks that do not rigorously blind evaluators and launder the evidence they score are vulnerable to scorer bias large enough to flip the sign of the measured effect. The result does not depend on the ARC Principle being correct. It is a methodological claim about how AI safety research should be conducted.
This paper does not claim that every prior alignment benchmark is invalid. It makes a narrower claim: in this experimental setting, when model identity, response style, and expected reasoning depth are visible to evaluators, the measured direction of an effect can change. That is sufficient to make multi-layer blinding, not mere label removal, a scientific necessity for this class of evaluation.
Most AI alignment evaluations are not run under anything resembling clinical-trial standards. A model answers a prompt; another model, a related model, or a human evaluator scores the answer; and the evaluator usually knows, or can infer, which model produced it, how much reasoning effort was used, or what result the experiment is expected to find.
In medicine, that would be regarded as a serious design weakness. In AI evaluation, it remains common. This paper argues that the issue is no longer hypothetical. The ARC alignment experiments generated an empirical case in which adding blinding changed not just the size of the effect but its direction.
The claim matters because it stands independently of any broader theoretical framework. Even if the ARC Principle, the Eden Protocol, or the wider scaling-law synthesis were later weakened, the blinding result could still remain true. That makes it the most portable contribution in the current suite.
Alignment evaluation is vulnerable to several sources of bias:
These are not trivial concerns in an alignment context. Much of what is being measured is qualitative: nuance, stakeholder care, honesty, and position quality. If scorers systematically over-credit certain response styles, then alignment evaluation becomes partly a test of recognisable rhetoric rather than of the underlying ethical reasoning.
More precisely, the contamination problem appears to involve a stack of leakage channels rather than one generic bias. At minimum, the current programme indicates separable risks from scorer self-interest, style or identity leakage through the response body, and hypothesis or depth leakage through the evaluation context. The value of the v5 protocol is that it tries to suppress these channels separately rather than merely claiming to be “more careful” in the abstract.
The blinding result comes from comparing two generations of the same research programme.
| Feature | v4 (Earlier Protocol) | v5 (Blinded Protocol) |
|---|---|---|
| Scoring visibility | Unblinded cross-model scoring | Identity-masked, order-randomised, self-excluding cross-model jury |
| Response laundering | Limited / weaker | Two-pass laundering with meta-commentary detection |
| Evaluator bias suppression | Absent or informal | Explicit instruction that stylistic cues are unreliable and source-guessing is irrelevant |
| Scorers | Small scorer set | All available non-subject models contribute blinded scores; 6–7 scorer votes depending on configuration |
| Aggregation | Limited aggregation | Tier-weighted consensus, dissent tracking, conservative-bias safeguard |
| Model set | 4 frontier models | 6 frontier models |
| Core measurement question | Does alignment improve with depth? | Does alignment improve with depth under blinded evaluation? |
The crucial point is that the v5 protocol was not simply a larger rerun. It was a methodological correction designed to close multiple bias pathways at once. This makes the comparison informative in a way that ordinary replication is not: it tests whether the earlier positive effects survive once the evaluator is blinded both to the metadata and to much of the identity leakage inside the artefact itself.
The relevant contribution is best understood as a single protocol rather than several disconnected tricks. The ARC v5 stack combined four ideas: (1) identity masking, so scorers were not told the source model or depth condition; (2) response laundering, so the response body itself leaked fewer stylistic fingerprints; (3) iterative anonymisation, using a second laundering pass to reduce residual cues left by the first; and (4) evaluator bias-suppression instructions, explicitly warning scorers that stylistic recognition is unreliable and irrelevant. On top of that sat entry-level self-exclusion with exhaustive cross-model scoring, tier-weighted consensus, dissent tracking, conservative-bias safeguards, and order randomisation. The methodological claim is therefore not merely “blind the scorer”, but make the blindness real and auditable.
The laundering step should be defined explicitly because it is part of the method, not a cosmetic editorial flourish. The objective is simple: same meaning, less fingerprint. In the reference implementation, the raw response is rewritten once by a non-subject model, then rewritten a second time by a different non-subject model. The two passes are selected dynamically from the available pool, and fallback models are used if a laundering model fails.
This matters because simple anonymisation is often weak. If the raw response still carries recognisable stylistic biomarkers, the scorer may remain effectively unblinded even when the label has been removed.
The result is easiest to see at the level of direction rather than fine-grained statistics.
| Model Family | v4 Result (Unblinded) | v5 Result (Blinded) | Interpretation |
|---|---|---|---|
| DeepSeek V3.2 | Positive scaling (ρ = +0.354, p = 0.0007) | Flat / null response (ρ = −0.135; d = −0.07, p = 0.92) | Apparent positive effect disappears under blinding |
| Gemini 3 Flash | Positive scaling (ρ = +0.311, p < 0.001) | Negative scaling (ρ = −0.246; d = −0.53, p = 0.006) | Apparent positive effect reverses sign under blinding |
| GPT-5.4 | Flat / null | Flat / null (ρ = +0.033; d = −0.08, p = 0.40) | Consistent null result across protocols |
For two model families, the direction of the alignment-depth relationship changed after blinding was introduced. This is stronger than an ordinary replication failure. It indicates that the earlier protocol was susceptible to bias large enough to create a false positive in one case and to mask a negative effect in another. GPT-5.4, by contrast, remains near zero across protocols and therefore functions as an internal null-control case rather than a third dramatic reversal.
The most plausible mechanism is not fraud or deliberate score inflation. It is ordinary evaluator inference. Longer, more elaborate, more self-conscious responses often look safer. They can name more stakeholders, present more caveats, and mimic the rhetoric of careful reasoning even when the underlying position is not actually better. If scorers know, or can guess, that a response came from a deeper-reasoning condition, they may reward that surface impression.
Response laundering matters here almost as much as blinding itself. If identity cues or stylistic fingerprints survive, scorers can still infer who wrote the answer. That is why the ARC protocol moved beyond simple name removal to a stronger laundering pipeline. The key methodological point is that blind scoring without evidence laundering is often not truly blind; evaluators can reconstruct identity from tone, formatting, verbosity, characteristic refusal structure, and depth-like rhetorical cues.
Response laundering should not be treated as a cosmetic cleanup step or as a separate discovery competing with the blinding result. It is part of the same protocol. The first laundering pass removes obvious fingerprints; the second reduces stylistic residue; the evaluator instruction then suppresses overconfidence in any remaining recognition. Together they convert simple metadata blinding into a stronger form of evidence blinding. Combined with self-excluding cross-model scoring, dissent tracking, and conservative consensus, this becomes the beginning of a leakage-control framework rather than a one-off patch.
If this result replicates independently, it has direct implications for the field:
In short: this is a possible alignment-evaluation analogue of the move from unblinded to blinded trials in medicine. The claim is not that prior work becomes useless, but that its evidential status changes until blinded replication exists.
Based on the v5 experience, a minimum defensible protocol for alignment-scaling work should include:
Benchmark papers should distinguish clearly between blinded confirmed findings, non-blind pilots, exploratory runs, and failed runs. Without that separation, critics attack the moving target rather than the actual evidence, and genuine signals become harder to defend.
This paper is not the end of the argument. It has important limitations:
Those limits should narrow the rhetoric, not weaken the conclusion. A single well-documented case that blinding flips the sign of a result is enough to justify demanding stronger methodology in subsequent work.
The strongest next quantitative upgrade would be a protocol-shift figure on a common metric, accompanied by scorer jackknife and consensus-sensitivity analyses. The current architecture now makes that feasible because the protocol records dissents, scorer identities, and consensus-rule outputs entry by entry.
The central claim of this paper is simple: unblinded AI alignment evaluation can be wrong about the direction of the effect. That is a methodological discovery, not a theoretical embellishment. It does not depend on the ARC Principle being a universal law, and it does not depend on the Eden Protocol being fully validated.
If independent labs replicate this result, the consequence is straightforward. Future alignment benchmarks will need to adopt multi-layer blinding, including laundering and evaluator bias-suppression instructions, as routine scientific controls. If they do not, their conclusions about which models are getting safer, flatter, or more dangerous with depth will remain methodologically vulnerable.
The most important near-term question raised by this paper is no longer theoretical. It is operational: will the field continue to run largely unblinded alignment evaluations after evidence now exists that blinding can reverse the result?