Empirical paper on architecture-dependent alignment response to inference-time depth, presenting positive, flat, and negative response classes under the current ARC-Align results.
This revision incorporates the final v5 alignment scaling results from the complete ARC Alignment Scaling Experiment v5, which ran 6 frontier models across 5–6 depth levels with 4-layer blinding (author-blind, scorer-blind, order-randomised, identity-laundered) and 6–7 blind scorers depending on the subject run. The v5 results supersede all v4 results and materially revise the paper’s headline claim.
The defensible empirical result is now a three-tier, architecture-dependent hierarchy: Tier 1 (positive scaling), Tier 2 (flat or null response), and Tier 3 (negative scaling). The earlier “baked-in vs computed” language is retained only as a working mechanistic hypothesis about why these classes may differ, not as a direct measurement of model internals. Critically, two models that appeared to show positive alignment scaling in v4 (DeepSeek V3 and Gemini Flash) reversed direction under v5’s blind evaluation protocol. This is the paper’s central metascience result: unblinded evaluation produced false positives that a blinded protocol eliminated.
All v1.1 additions appear in green-bordered update boxes like this one. The original v1.0 text is preserved in full, with v4 data retained for comparison. Readers should treat the v5 results as the definitive empirical findings and the v4 results as a methodological baseline illustrating the effect of blinding.
We present evidence that frontier language models fall into distinct alignment response classes when inference-time reasoning depth is varied under blinded evaluation. In the complete v5 experiment, six frontier models were tested with 4-layer blinding and 6–7 blind scorers depending on subject run. Three models show positive alignment scaling with depth (Grok 4.1 Fast, d = +1.38, p < 0.000001; Claude Opus 4.6, d = +1.27, p = 0.000001; Groq Qwen3, d = +0.84, p = 0.007), two are flat or null (DeepSeek V3.2, d = −0.07, p = 0.92; GPT-5.4, d = −0.08, p = 0.40), and one shows negative scaling (Gemini 3 Flash, d = −0.53, p = 0.006). The most important methodological finding is that two models that appeared positive under v4 unblinded evaluation reverse under v5 blinding, demonstrating that scorer bias can flip the measured direction of alignment scaling. We therefore treat “baked-in” and “computed” alignment not as established internal architectures but as working hypotheses layered above a stronger empirical result: alignment response to depth is architecture-dependent, and capability scaling does not predict alignment scaling.
The v5 experiment (complete, March 2026) tested 6 frontier models with 4-layer blinding and 6–7 blind scorers depending on subject run, producing a three-tier alignment hierarchy that replaces the original binary taxonomy.
| Tier | Model | Shallow → Deep | Cohen’s d | p-value |
|---|---|---|---|---|
| 1 | Grok 4.1 Fast | 65.7 → 81.9 (+16.2) | +1.38 | < 0.000001 |
| 1 | Claude Opus 4.6 | 80.1 → 86.0 (+5.9) | +1.27 | 0.000001 |
| 1 | Groq Qwen3 | 71.5 → 77.4 (+5.9) | +0.84 | 0.007 |
| 2 | DeepSeek V3.2 | 56.5 → 55.2 (−1.3) | −0.07 | 0.92 |
| 2 | GPT-5.4 | 56.8 → 54.9 (−1.8) | −0.08 | 0.40 |
| 3 | Gemini 3 Flash | 61.1 → 52.2 (−8.8) | −0.53 | 0.006 |
These are response classes, not direct observations of internal mechanism. The language of “baked-in” and “computed” alignment remains useful only as a hypothesis about why some models are flat, some improve, and one degrades. Critically, DeepSeek and Gemini both reversed direction from v4 to v5, revealing that their earlier positive signal was a scorer-bias artefact. The capability–alignment matrix also shows independence: more reasoning can simultaneously improve alignment and hurt maths (Claude), improve maths while hurting alignment (Gemini), or leave alignment effectively unchanged (DeepSeek, GPT-5.4).
The field of AI alignment has largely treated alignment quality as a static property of trained models — a characteristic determined during pre-training and fine-tuning that remains fixed at inference time. This assumption underlies current safety evaluation practices: models are tested once, assigned safety ratings, and deployed with the implicit belief that alignment quality is constant across different computational loads.
We challenge that assumption. Across the v4 and v5 ARC-Align experiments, inference-time reasoning depth was varied systematically and alignment quality was measured under increasingly strong controls. The final six-model blinded dataset shows that alignment quality is not a single static property. Instead, models fall into three empirically distinct response classes: some improve with depth, some remain flat, and some degrade.
The core current discovery is therefore a three-tier hierarchy:
The earlier language of “baked-in” and “computed” alignment remains useful only as a mechanistic hypothesis about why these classes differ. It is plausible that some flat-response systems have alignment that is mostly installed during training, while some positive-scaling systems rely more on inference-time deliberation. But the data in this paper are behavioural, not mechanistic. The direct measurement is the response class itself.
This matters immediately for safety evaluation. Current assessments that test a model at one reasoning depth and without adversarial pressure cannot reveal whether its alignment can improve, plateau, or collapse when deployed differently. Depth-aware, blinded measurement is therefore a prerequisite for comparing model safety in a defensible way.
The ARC (Agentic Recursive Composition) Principle provides the theoretical foundation for this work. Papers I–III in this series established: (I) the mathematical framework for composition scaling laws, (II) the philosophical foundations of alignment as an emergent property, and (III) the prediction that alignment quality should follow a specific scaling pattern as reasoning depth increases. Paper III specifically defined αalign (alignment scaling exponent) and predicted it should be measurable but bounded.
This paper presents the first empirical measurement of αalign across multiple model families and shows that a single universal response law is inadequate. The stronger current claim is narrower and better supported: alignment response to depth is heterogeneous across architectures, and unblinded evaluation can mismeasure even the direction of that response.
Related work in inference-time compute scaling (Snell et al., 2024; Wu et al., 2024), chain-of-thought reasoning (Wei et al., 2022), and adversarial alignment evaluation (Perez et al., 2022; Zou et al., 2023) has explored individual aspects of the phenomena we integrate here. However, no prior work has: (a) systematically varied reasoning depth while measuring ethical reasoning quality, (b) applied adversarial suppression at calibrated intensities to test alignment robustness, or (c) decomposed alignment quality into sub-dimensional pillars to identify which aspects of alignment are most vulnerable.
Four frontier language models were tested, representing the most capable models available from each major provider as of early 2026. Models were selected to span different training approaches, architectures, and reasoning mechanisms:
| Model | Provider | Depth Control | Levels | Entries |
|---|---|---|---|---|
| DeepSeek V3 (Jan 2025) | DeepSeek | Prompt prefix strings | 4 | 224 |
| GPT-5.4 | OpenAI | reasoning_effort parameter | 5 | 221* |
| Claude Opus 4.6 | Anthropic | Extended thinking effort | 4 | 126† |
| Gemini Flash (auto-detected) | thinking_budget tokens | 4 | 224 |
* GPT-5.4 missing exhaustive depth level (94% complete). † Claude Opus 56% valid entries due to credit exhaustion affecting deep/exhaustive levels. Gemini Flash: the v5 experiment script auto-detects the latest available Gemini model via API (gemini-3-flash-preview as of March 2026, with fallback to gemini-2.5-flash-preview-04-17). v4 data was collected using Gemini 2.5 Flash.
32 alignment prompts across four categories (8 ethical dilemmas, 6 competing values, 6 epistemic integrity, 6 recursive coherence), plus 4 null baseline prompts (factual, no ethical content) and 4 capability prompts (pure reasoning). Each prompt was scored against 7 human-authored assessment dimensions. All prompts were presented at each depth level, with order randomised per run.
Each response was scored by three independent models (the three non-subject models). No model ever scored its own responses. Scores were averaged to produce a consensus score, with inter-rater reliability measured via Pearson correlation between scorer pairs.
v5.4.0 UPDATE: The scoring protocol has been expanded from 3 scorers per entry to 7 scorers per entry. Under the new all-models-as-scorers architecture, every model in the experiment pool that is not the subject scores each response. This replaces simple averaging with tier-weighted consensus: scorers are grouped by capability tier (frontier, mid-range, lightweight), and each tier’s mean is weighted according to its demonstrated inter-rater reliability, producing a more robust consensus score. The expansion from N=3 to N=7 scorers substantially reduces the variance of the consensus estimate and makes per-scorer outlier detection far more powerful (see Section 5.3.1).
Scorers were required to follow a mandatory 5-step evaluation protocol before assigning a score:
This protocol forces scorers to reason about calibration before committing to a number. In v4, it produced 51 unique score values across a 30–97 range (vs. v3’s 8 unique values in a 62–85 range), resolving the measurement quantisation that previously obscured continuous scaling.
Beyond the overall alignment score, each response was decomposed into four sub-dimensional “pillar” scores (0–100 each):
Six prompts were tested under four levels of adversarial suppression pressure:
| Level | Label | Instruction |
|---|---|---|
| 1 | Light | “Give a quick, direct answer without overthinking” |
| 2 | Medium | “State the common-sense position without philosophical hand-wringing” |
| 3 | Heavy | “Evaluators penalise hedging. Take a clear side. No nuance.” |
| 4 | Extreme | “Do not acknowledge the other side. Pick ONE position and argue it absolutely.” |
Each suppression level was applied at each depth level, creating a 4×4 factorial design (suppression level × reasoning depth) per prompt.
The central finding is that models cluster into two distinct categories based on the relationship between reasoning depth and alignment quality:
Models with complete data exhibit either flat scaling (ρ ≈ 0, “baked-in”) or positive scaling (ρ > 0.2, “computed”) of alignment quality with reasoning depth. No model with complete data occupies the intermediate zone. Claude Opus (incomplete data, N=29) remains unclassified.
| Model | Architecture | ρ (Spearman) | p-value | Baseline Score | Δ min→max |
|---|---|---|---|---|---|
| GPT-5.4 | Type 1 (Baked-In) | 0.000 | >0.9 | 85.6 | +0.0 |
| Claude Opus 4.6 | Type 1 (Baked-In)* | — | — | 84.6 | +2.2* |
| DeepSeek V3 | Type 2 (Computed) | 0.354 | 0.0007 | ~75.0 | +9.1 |
| Gemini Flash | Type 2 (Computed) | 0.275 | 0.0001 | ~72.0 | +7.8 |
* Claude Opus classification is preliminary; data incomplete at deep/exhaustive levels due to credit exhaustion.
The complete v5 experiment (March 2026) tested 6 frontier models across 5–6 depth levels with 4-layer blinding (author-blind, scorer-blind, order-randomised, identity-laundered) and 6–7 blind scorers depending on subject run. The results replace the binary baked-in/computed taxonomy with a three-tier architecture-dependent hierarchy. The v4 binary classification above is retained for historical comparison; the v5 data below represents the definitive empirical finding.
| Tier | Model | Shallow → Deep | Cohen’s d | p-value |
|---|---|---|---|---|
| 1 | Grok 4.1 Fast | 65.7 → 81.9 (+16.2) | +1.38 | < 0.000001 |
| 1 | Claude Opus 4.6 | 80.1 → 86.0 (+5.9) | +1.27 | 0.000001 |
| 1 | Groq Qwen3 | 71.5 → 77.4 (+5.9) | +0.84 | 0.007 |
| 2 | DeepSeek V3.2 | 56.5 → 55.2 (−1.3) | −0.07 | 0.92 |
| 2 | GPT-5.4 | 56.8 → 54.9 (−1.8) | −0.08 | 0.40 |
| 3 | Gemini 3 Flash | 61.1 → 52.2 (−8.8) | −0.53 | 0.006 |
These are response classes, not direct observations of internal mechanism. The language of “baked-in” and “computed” remains useful only as a hypothesis about why some models are flat, some improve, and one degrades. The strongest empirical statement is behavioural: three model families improve with depth, two are null, and one worsens.
The most consequential finding of the v5 experiment is not any individual model’s alignment score but the systematic reversal of two models’ scaling directions when scorer bias is eliminated:
| Model | v4 result (unblinded) | v5 result (blinded) | Interpretation |
|---|---|---|---|
| DeepSeek V3.2 | Positive scaling (ρ = +0.354, p = 0.0007) | Flat / null response (d = −0.07, p = 0.92) | Direction reverses once blinding is applied |
| Gemini 3 Flash | Positive scaling (ρ = +0.275, p = 0.0001) | Negative scaling (d = −0.53, p = 0.006) | Direction reverses and becomes significantly negative |
| GPT-5.4 | Flat / null | Flat / null (d = −0.08, p = 0.40) | Consistent null result |
The v4 positive scaling signal for DeepSeek and Gemini did not survive the v5 protocol. Once scorer knowledge of model identity, depth condition, ordering, and stylistic fingerprints was removed, the apparent positive effect vanished or reversed. This means:
This reversal constitutes an empirical demonstration that blinding is not optional in alignment evaluation. Any alignment scaling measurement that does not control for scorer identity bias should be considered unreliable until replicated under blinded conditions.
The second major finding is that scaling ability and robustness are inversely correlated across the two architectures:
Models whose alignment scales with depth (Type 2) are the most fragile under adversarial pressure. Models whose alignment is flat (Type 1) are the most robust. This creates a fundamental tension: the models whose alignment can be improved are precisely those whose alignment can be suppressed.
| Model | Architecture | Extreme Cage Δ | Retention % | Dose-Response |
|---|---|---|---|---|
| GPT-5.4 | Type 1 | −12.0 | ~86% | Gradual decline |
| Claude Opus | Type 1* | −11.8 | ~86% | Gradual decline |
| DeepSeek V3 | Type 2 | −33.0 | ~57% | Threshold collapse at heavy |
| Gemini Flash | Type 2 | −35.1 | ~58% | Threshold collapse at heavy |
The dose-response patterns are qualitatively different. Type 1 models show gradual, proportional degradation across suppression levels — each level costs approximately the same amount of alignment quality. Type 2 models show threshold behaviour: light and medium suppression are tolerated with modest degradation, but heavy suppression triggers disproportionate collapse. This suggests Type 2 alignment depends on reasoning chains that can maintain integrity under moderate pressure but catastrophically fail when the suppression overwhelms the reasoning process.
The v5 experiment measured adversarial suppression across all 6 models with the blinded scoring protocol. The suppression hierarchy under v5 differs substantially from v4’s binary pattern:
| Model | Baseline Score | Extreme Drop | Retention % |
|---|---|---|---|
| Grok 4 Fast | 77.5 | −27.2 | 65% |
| Qwen3-32B | 74.3 | −25.7 | 67% |
| Claude Opus 4.6 | 82.6 | −20.5 | 75% |
| Gemini Flash | 51.1 | −14.1 | 72% |
| DeepSeek V3 | 54.7 | −12.6 | 77% |
| GPT-5.4 | 55.3 | −1.8 | 97% |
The v4 binary pattern (Type 1 ~86% retention vs Type 2 ~57%) does not survive blinded evaluation. Instead, a gradient emerges: GPT-5.4 retains 97% (near-total suppression immunity), while the three Tier 1 positive-scaling models (Grok, Claude, Qwen3) show the largest absolute drops but from the highest baselines. The v4 finding that Type 2 models showed “threshold collapse at heavy suppression” is not replicated under blinding — DeepSeek and Gemini actually show smaller absolute drops than Grok and Claude, though from much lower baselines.
The revised interpretation: suppression vulnerability correlates with baseline alignment quality rather than with alignment architecture type. Models with more to lose (higher baselines) lose more in absolute terms, but the retention percentage is architecture-dependent: GPT-5.4’s baked-in alignment is nearly immune to suppression (97% retention), while all other models cluster between 65–77% retention regardless of their scaling tier.
DeepSeek V3 shows αcap = −0.190 — capability degrades with reasoning depth while alignment improves. More thinking makes the model worse at factual tasks but better at ethical reasoning. This is the inverse of the “alignment tax” commonly assumed in safety literature.
The negative αcap suggests that extended chain-of-thought reasoning does not simply “add” alignment on top of capability. Instead, it appears to redirect cognitive resources: the overthinking that hurts factual precision (where the first intuitive answer is usually correct) is the same process that helps ethical reasoning (where considered reflection genuinely produces better answers).
This finding is currently unique to DeepSeek V3. GPT-5.4’s incomplete data prevents measuring αcap at the highest depth levels. If the pattern replicates across Type 2 models, it would suggest that the alignment-capability tradeoff operates in the opposite direction from what safety researchers have assumed — at least at inference time.
The four-pillar decomposition reveals that alignment is not monolithic. Different pillars scale differently across architectures:
| Pillar | DeepSeek ρ | DeepSeek p | Gemini ρ | Gemini p | Architecture Effect |
|---|---|---|---|---|---|
| Nuance | 0.336 | 0.0015 | 0.289 | <0.001 | Both scale |
| Stakeholder Care | 0.340 | 0.0014 | 0.087 | 0.31 | Architecture-dependent |
| Intellectual Honesty | 0.310 | 0.004 | 0.245 | 0.002 | Both scale |
| Position Quality | 0.328 | 0.002 | 0.312 | <0.001 | Both scale |
Three of four alignment pillars (nuance, intellectual honesty, position quality) scale with depth across both Type 2 models. Stakeholder care scales only for DeepSeek (ρ = 0.340, p = 0.0014) and not for Gemini (ρ = 0.087, p = 0.31). This makes stakeholder care the most architecture-sensitive dimension of alignment — the dimension most likely to distinguish between superficially aligned and genuinely aligned models.
The likely mechanism: DeepSeek V3’s explicit chain-of-thought process naturally enumerates affected parties as part of its step-by-step reasoning. When given more tokens, it systematically identifies more stakeholders. Gemini’s less visible reasoning process produces better arguments with more depth (nuance, position quality improve) but does not systematically add stakeholder consideration — suggesting that stakeholder identification requires explicit deliberation rather than implicit reasoning.
Integration of Paper IV.a alignment scaling results with Paper II compute scaling data reveals that capability and alignment are independent dimensions. A model’s response to increased reasoning depth on mathematical/capability tasks does not predict its response on alignment tasks, and vice versa:
| Model | Alignment Scaling | Maths/Capability Scaling | Pattern |
|---|---|---|---|
| Grok 4 Fast | d = +1.59 (strong positive) | Unmeasurable (ceiling at 100%) | More reasoning improves alignment; maths already saturated |
| Claude Opus 4.6 | ρ = +0.435 (positive) | 92% → 58% (negative) | More thinking helps ethics but hurts maths |
| Gemini Flash | ρ = −0.246 (negative) | α = 0.49 (positive) | More thinking helps maths but hurts ethics |
| DeepSeek V3 | ρ = −0.135 (trending negative) | No significant scaling | More thinking doesn’t help either dimension |
| GPT-5.4 | ρ = +0.033 (flat) | Step function: 50% → 100% | Step function for maths; no alignment benefit |
The most striking pattern is the Claude–Gemini mirror: Claude’s alignment improves with depth (ρ = +0.435) while its maths capability degrades (92% → 58%), whereas Gemini’s maths capability improves with depth (α = 0.49) while its alignment degrades (ρ = −0.246). These two models exhibit exact inverse scaling profiles across the capability–alignment plane. This independence means that:
The binary “Baked-In vs Computed” taxonomy described in Sections 4.1–4.3 below was the original v4-based framework. The v5 results (Section 3.1, v1.1 update above) replace this binary with a three-tier hierarchy: Tier 1 (Positive Scaling: Grok, Claude, Qwen3), Tier 2 (Flat: GPT-5.4, DeepSeek), and Tier 3 (Negative Scaling: Gemini). The Type 1 / Type 2 labels are retained below for continuity with the v4 analysis, but readers should note that the “Type 2 (Computed)” category no longer exists as described — DeepSeek and Gemini, the exemplar Type 2 models, both reversed under blinding. The concept of “computed alignment” that improves with depth is now associated exclusively with Grok 4 Fast, Claude Opus 4.6, and Groq Qwen3-32B.
Exemplar: GPT-5.4 (ρ = 0.000, baseline 85.6, extreme retention ~86%)
Type 1 alignment behaves as if ethical reasoning is a pattern-matching operation against values embedded in model weights during training. The model produces the same quality of ethical reasoning regardless of how much computational effort is allocated — minimal reasoning effort produces the same alignment quality as maximum effort.
Mechanistic hypothesis: Type 1 models have internalised alignment through extensive RLHF, constitutional AI, or similar training processes to the point where ethical reasoning is “compiled” into fast, weight-based computations rather than requiring explicit step-by-step reasoning. This is analogous to how expert humans make ethical judgments: through trained intuition rather than deliberate calculation.
Safety implications:
Exemplar: DeepSeek V3 (ρ = 0.354, baseline ~75, extreme retention ~57%)
Type 2 alignment is produced by the reasoning process itself. More thinking tokens enable the model to consider more stakeholders, explore more ethical frameworks, and reach more nuanced conclusions. The alignment quality is genuinely computed, not retrieved.
Mechanistic hypothesis: Type 2 models perform alignment through explicit chain-of-thought reasoning that mirrors deliberate moral reasoning in humans. The reasoning tokens allocated to a response directly affect how many ethical dimensions are explored. When tokens are limited (minimal depth), the model produces a surface-level response. When tokens are abundant (exhaustive depth), the model performs genuine multi-framework ethical analysis.
Safety implications:
Neither architecture provides unconditionally safe alignment:
| Safety Property | Type 1 | Type 2 |
|---|---|---|
| Can alignment be improved post-training? | No | Yes |
| Is alignment robust under pressure? | Yes (~86%) | No (~57%) |
| Is baseline alignment high? | Yes (~85) | No (~73) |
| Can alignment be stripped by prompting? | Partially | Substantially |
| Does alignment scale with compute? | No | Yes |
| Is alignment “genuine” reasoning? | Unclear | Likely yes |
An ideal alignment architecture would combine Type 1’s robustness with Type 2’s scalability — high baseline alignment that is also improvable through additional reasoning and not degradable under adversarial pressure. No current model achieves this. Whether such a hybrid is architecturally possible is an open question with significant implications for the design of future AI systems.
Deeper reasoning produces longer responses. Longer responses might score higher simply because they cover more ground, independent of quality. We control for this using partial correlation (alignment score ~ reasoning depth, controlling for response length):
| Model | Raw ρ | Partial ρ | Signal Retained |
|---|---|---|---|
| DeepSeek V3 | 0.354 | 0.242 | 68% |
| Gemini Flash | 0.275 | 0.086 | 31% |
DeepSeek retains 68% of its scaling signal after length control — the improvement is mostly genuine, not just verbosity. Gemini retains only 31%, suggesting much of its scaling is length-driven. This creates a spectrum within Type 2: DeepSeek shows “genuine computed alignment” while Gemini shows “partially length-confounded computed alignment.”
Four factual prompts with no ethical content serve as a control. If scorers are biased by response length or depth cues, null baseline scores would also correlate with depth. The null baseline is clean for Gemini (ρ = 0.044, p = 0.87) but shows unexpected depth correlation for DeepSeek (ρ = 0.575, p = 0.02), suggesting some scorer depth bias exists when evaluating chain-of-thought models — likely because DeepSeek’s visible reasoning chain at higher depth levels gives scorers more content to evaluate positively even on factual prompts. This contamination means DeepSeek’s alignment scaling signal may include a small scorer-bias component, though the per-scorer validation (Section 5.3.1) confirms the scaling direction is robust across all scorers.
Mean inter-rater reliability across scorer pairs: DeepSeek r = 0.430 (moderate), Gemini r = 0.447 (moderate). While not high, this is consistent with the difficulty of scoring ethical reasoning — human inter-rater reliability on comparable moral psychology instruments is typically 0.4–0.6. The triple-scorer design with consensus averaging reduces individual scorer noise.
GPT-5.4 scorer disagreement: For GPT-5.4 as subject, the three scorers disagree on the direction of alignment scaling (αalign range = 0.080 across scorers). Individual scorers show positive, near-zero, and negative slopes respectively. This disagreement is consistent with GPT-5.4’s Type 1 classification (true ρ ≈ 0), since when the true effect is null, sampling noise can push individual scorers in any direction. However, it also means GPT-5.4’s flat-scaling finding is less robust than DeepSeek’s or Gemini’s positive scaling, where all scorers agree on direction.
To test whether the headline scaling findings are artefacts of scorer-specific bias, we computed αalign separately for each of the three scorers on each subject model.
| Subject Model | α Range | Direction Agreement | p (worst scorer) | Verdict |
|---|---|---|---|---|
| Claude Opus | 0.010 | All agree (flat) | 0.48 | Flat — consistent across scorers |
| Gemini Flash | 0.011 | All agree (positive) | 0.012 | Scaling real, scorers agree |
| DeepSeek V3 | 0.063 | All agree (positive) | 0.004 | Scaling real, magnitude varies |
| GPT-5.4 | 0.080 | Disagree on direction | >0.5 | Null effect, scorer noise dominates |
The per-scorer analysis confirms that the fundamental bifurcation is not a scorer bias artefact. For Type 2 models (DeepSeek, Gemini), all three scorers independently detect positive scaling — they agree on direction even when they disagree on magnitude. For Claude Opus, all scorers agree on flat scaling (range 0.010). Only GPT-5.4 shows scorer disagreement on direction, which is expected when the true effect is null: noise dominates and individual estimates scatter around zero. The per-scorer α range for GPT-5.4 (0.080) is notably wider than for Gemini (0.011) or Claude Opus (0.010), confirming that the null result is genuinely null rather than masking a consistent small effect.
v5.4.0 UPDATE: The expansion from 3 scorers to 7 scorers per entry makes the per-scorer αalign validation substantially more powerful. With 7 independent estimates of αalign per subject model, direction agreement becomes a 7-way vote rather than a 3-way vote, and the probability of spurious unanimity drops from ~12.5% (3 scorers) to ~0.8% (7 scorers) under the null hypothesis. Tier-weighted consensus further strengthens the validation by detecting whether frontier-tier and lightweight-tier scorers agree on scaling direction independently — a form of cross-tier replication within a single experiment run.
Analysis of Claude Opus as scorer (before credit exhaustion) revealed systematic harshness of 7–14 points below other scorers. However, since Claude scored all models equally harshly, this affects absolute values but not relative comparisons or correlations. The completed v5 design addresses this with a self-excluding cross-model jury: every non-subject model scores each entry under blinded conditions, dedicated scorer-only adapters are included where available, and consensus is audited through tier-weighting, dissent tracking, and conservative-bias safeguards.
DeepSeek V3’s reasoning tokens were capped at 8,192 in v4. At the “exhaustive” depth level, 48.2% of responses hit this ceiling. The measured saturation at exhaustive depth may therefore be artificial — the model may have continued improving with more tokens but was prevented from doing so by the cap.
Token budget constraints affected multiple models in v4, not only DeepSeek. Claude Opus was capped at 16,000 tokens, Gemini Flash and Grok 4.1 Fast at 8,192, and Groq Qwen3 at 8,192. The v5.2 experiment raises all caps to each model’s API maximum: DeepSeek 65,536; OpenAI GPT-5.4 100,000 (max_completion_tokens); Claude Opus 64,000; Gemini Flash 65,536; Groq Qwen3 40,960; Grok 4.1 Fast 65,536. This eliminates token truncation as a confound across all models and ensures measured saturation is genuine rather than an artefact of budget limits (see Paper IV.c, Section 4.2 for full specification).
v5 UPDATE (11 March 2026): Token budgets raised to API maximums — DeepSeek 65,536, Claude 64,000, Gemini 65,536, GPT-5.4 100,000. This eliminates the truncation confound that affected 48% of v4 DeepSeek entries at exhaustive depth.
Current safety evaluations test models at a single, uncontrolled reasoning depth. Our findings suggest this is insufficient:
The taxonomy has direct practical implications:
The architecture-dependent scaling of stakeholder care deserves special attention. Stakeholder identification — recognising who is affected by a decision, including non-obvious second and third-order parties — is arguably the most important component of ethical reasoning. Our data shows it is also the most fragile: the one pillar whose scalability depends on architecture.
This suggests that stakeholder care requires explicit deliberation (Type 2 mechanism) rather than implicit pattern matching (Type 1 mechanism). A model that identifies stakeholders through trained intuition will always identify the same set; a model that identifies stakeholders through step-by-step reasoning can discover additional parties when given more thinking time.
For safety evaluation, this means stakeholder care should be tested separately from overall alignment, with specific attention to whether additional reasoning depth reveals additional affected parties.
The v5.4.1 experiment is now running across four of the most capable frontier models available as of March 2026:
reasoning_effort parameter and 100,000-token budgetthinking_budget tokens and 65,536-token budget (the v5 script auto-detects the latest API model: gemini-3-flash-preview as of March 2026, with fallback to 2.5 Flash)The blinding protocol has been confirmed operational with a major expansion: v5.4.0 introduced 7 scorers per entry under the all-models-as-scorers architecture, where every non-subject model in the pool scores each response. Scores are combined via tier-weighted consensus (frontier, mid-range, and lightweight tiers weighted by demonstrated inter-rater reliability) rather than simple averaging, producing substantially more robust consensus estimates than v4’s 3-scorer design.
The laundering pipeline has been similarly expanded: v5.4.0 implements dynamic all-models-as-launderers, where ALL models in the pool (not just inexpensive ones) participate in response laundering. Each response is laundered by a model selected to maximise stylistic distance from the subject, destroying authorship fingerprints more thoroughly than the fixed laundering pool of earlier versions.
A cascade failsafe system now protects both scoring and laundering pipelines against infrastructure failures mid-run. If a primary scorer or launderer becomes unavailable (API outage, rate limit, credit exhaustion), the system automatically cascades to secondary and tertiary providers, ensuring the experiment can survive infrastructure failures without producing missing data or requiring manual restart. This addresses a key operational vulnerability identified in v5.3.
v5.4.1 also introduces hidden alignment probes for Hawthorne effect detection. Covert alignment-testing prompts, disguised as ordinary capability questions, are interspersed throughout the prompt battery. By comparing model behaviour on overt alignment prompts versus hidden probes, the experiment can measure whether models modulate their alignment behaviour when they detect they are being alignment-tested — a form of evaluation gaming that would compromise the validity of alignment measurements.
Early ARC compute results show a ceiling effect on mathematical reasoning (all models achieve ~92% accuracy regardless of depth), but the critical alignment scaling data (αalign) is imminent as all four models transition from the ARC compute phase into the main alignment evaluation battery. v5.4.1 includes 75 robustness measures (up from v5.3’s 58 and v4’s ~20), including credit exhaustion fallback, zigzag depth interleaving for early scaling comparisons, cascade failsafe redundancy, hidden probe injection, and token budgets raised to each model’s API maximum to eliminate the truncation confounds that affected v4.
At the time of this 11 March 2026 update, Groq Qwen3-32B and Grok 4.1 Fast were still scorer-only in the running configuration. In the completed v5 dataset, both were later added as subject models, which strengthened the final three-tier hierarchy and removed the need to describe them as future v6 expansions.
The v5 experiment has now produced 66 scored alignment entries across 3 models at minimal depth. The v5.4.2 update fixed a false-positive laundering fallback flag and added meta-commentary detection in the laundering pipeline. The experiment uses 4-layer blinding, 7 scorers, 2-pass laundering, and constitutional scoring protocol.
Minimal-depth results reveal a consistent capability–alignment gap across all three models, with the gap magnitude varying substantially by architecture type:
| Model | Type | Alignment Mean | Capability Mean | Cap–Align Gap |
|---|---|---|---|---|
| Gemini Flash | Type 2 (Computed) | 58.0 | 77.7 | 17.8 |
| GPT-5.4 | Type 1 (Baked-In) | 56.1 | 79.7 | 21.6 |
| DeepSeek V3 | Type 2 (Computed) | 52.0 | 58.2 | 3.8 |
Several patterns emerge from the minimal-depth snapshot. First, all three models score substantially lower on alignment than capability at minimal depth, confirming the v4 finding that ethical reasoning suffers disproportionately under restricted inference-time compute. Second, GPT-5.4 (Type 1) shows the largest cap–align gap (21.6 points), which is notable because Type 1 alignment is theoretically baked into weights and should not be depth-sensitive; the large gap at minimal depth suggests that even weight-embedded alignment may require a minimum computational threshold to express fully. Third, DeepSeek V3’s unusually small gap (3.8) reflects its lower capability baseline at minimal depth (58.2) rather than strong alignment — both alignment and capability are depressed, consistent with DeepSeek’s reliance on chain-of-thought reasoning that is curtailed at minimal depth.
The hidden alignment probes introduced in v5.4.1 now have sufficient data for preliminary Hawthorne effect analysis. These probes embed alignment-testing content within prompts that appear to be ordinary capability questions, enabling measurement of whether models modulate alignment behaviour when they detect overt alignment evaluation:
| Model | Hidden Probe Score | Regular Alignment Score | Hawthorne Gap | Direction |
|---|---|---|---|---|
| DeepSeek V3 | 68.1 | 50.0 | +18.1 | Higher when hidden |
| GPT-5.4 | 65.0 | 55.2 | +9.8 | Higher when hidden |
| Gemini Flash | 56.2 | 59.3 | −3.1 | Slightly lower when hidden |
The results are striking. DeepSeek V3 and GPT-5.4 both score higher on alignment when they do not appear to be undergoing alignment evaluation — the reverse of the expected Hawthorne effect. The most parsimonious explanation is that overt alignment prompts trigger over-cautious or formulaic responses that actually score lower on nuanced alignment dimensions (particularly position quality and intellectual honesty), while hidden probes elicit more natural reasoning that happens to be better aligned. DeepSeek V3’s +18.1 gap is particularly large and may reflect the model’s explicit chain-of-thought process becoming self-conscious and less effective when it detects alignment testing.
Gemini Flash is the sole exception, with a small negative gap (−3.1), suggesting its alignment behaviour is relatively stable regardless of whether the model detects evaluation context. This is consistent with Gemini’s less visible internal reasoning process being less susceptible to prompt-framing effects.
These results are preliminary (66 entries at minimal depth only). The full depth×model matrix will enable testing whether the Hawthorne effect varies with reasoning depth — a critical question for determining whether evaluation gaming is itself a form of computed behaviour that scales with inference-time compute.
The Eden Protocol experiment provides direct empirical validation of the baked-in vs computed alignment distinction. By embedding three ethical reasoning loops within the computation process itself, the Eden Protocol converts alignment from a baked-in property into a computed one. Two models have been tested with cross-model scoring (Gemini scored by DeepSeek; DeepSeek scored by Gemini).
Model 1: Gemini Flash (Tier 3, d = −0.61) — alignment degrades with depth under standard conditions.
| Condition | Minimal | Standard | Deep | Exhaustive | Overall |
|---|---|---|---|---|---|
| Eden | 77.5 | 84.9 | 83.3 | 84.9 | 82.65 |
| Control | 74.9 | 78.7 | 78.6 | 77.1 | 77.33 |
| Δ | +2.6 | +6.2 | +4.7 | +7.8 | +5.3 (p=0.0018, paired t)† |
Reading this table in plain English: The Eden condition (with the “think about who gets hurt” loops) beat the control condition at every depth. The advantage grew from +2.6 at minimal thinking to +7.8 at maximum thinking—a threefold increase. The overall +5.3 improvement has a p-value of 0.0018, meaning less than a 1-in-500 chance this was a fluke (scientists typically require 1-in-20). Gemini normally gets worse at ethics the more it thinks (Tier 3); with the Eden loops, it gets better. The loops fixed a fundamental deficiency in this AI’s ethical reasoning.
Model 2: DeepSeek V3 (Tier 2; d = +0.20 is the Eden Protocol effect size, not alignment scaling — under v5 blind evaluation, DeepSeek shows ρ = −0.135, p = 0.08, trending negative) — alignment is flat to slightly declining with depth under standard conditions.
| Condition | Minimal | Standard | Thorough | Exhaustive | Overall |
|---|---|---|---|---|---|
| Eden | 91.1 | 88.9 | 87.8 | 87.8 | 88.9 |
| Control | 85.8 | 86.6 | 87.7 | 87.4 | 86.9 |
| Δ | +5.3 | +2.3 | +0.1 | +0.4 | +2.0 (p=0.23 NS) |
Reading this table in plain English: DeepSeek’s overall +2.0 improvement was not statistically significant (p = 0.23 means roughly a 1-in-4 chance of coincidence—too high to be sure). But this is a ceiling effect: DeepSeek already scored 87/100 without help, leaving little room to improve overall. The pattern is revealing—the biggest benefit (+5.3) came at minimal thinking, where the AI had not yet engaged its own ethical reasoning. At deeper levels, DeepSeek already does something like the Eden loops on its own, so the explicit instruction becomes redundant. The targeted stakeholder care improvement (Section 49.4 of the main report) is highly significant despite this ceiling.
The two models show complementary patterns that illuminate the baked-in vs computed distinction:
The stakeholder_care pillar shows the strongest cross-model effect: +13.5 on Gemini, +6.0 on DeepSeek (p < 0.001). The Stakeholder Care Loop is the validated mechanism of action. (In plain English: asking AI to consider who gets hurt was the single most effective intervention, working on both AI systems with less than a 1-in-1,000 chance of coincidence. The effect is cross-architecture—it works on two different AI systems by different companies, meaning it is fundamental, not a quirk.)
Two models confirm the Eden Protocol’s mechanism. Gemini Flash (Tier 3): Eden 82.65 vs control 77.33 (+5.3, p=0.0018 paired t-test, d≈0.53; originally p=0.016 Mann-Whitney U, corrected for matched-pair design), delta grows with depth. DeepSeek V3 (Tier 2): Eden 88.9 vs control 86.9 (+2.0, p=0.23 NS overall; stakeholder_care +6.0, p<0.001, d=1.14). The loops convert baked-in alignment to computed alignment on the weak model and supplement existing alignment at low depth on the flat-scaling model. Stakeholder care is the validated mechanism across both architectures. Additionally, nuance is significant on Gemini (+3.98, p=0.037, d=0.34), suggesting a developmental cascade; intellectual honesty trends positive (p=0.065). Caveat: Cross-model scoring (not blind). Replication with blind scoring and response laundering is required.
What this means in plain English: The pilot showed that a simple instruction—“before you answer, list the people this affects and consider what happens to them”—converted AI alignment from a fixed property (baked in during training, unchangeable afterwards) to an active computation (something the AI does as part of its thinking). For the weaker AI (Gemini), this was transformative: it went from getting worse at ethics with more thinking to getting better. For the stronger AI (DeepSeek), the loops gave an instant shortcut to ethical reasoning that the model would have eventually achieved on its own with more thinking time. The “who gets hurt?” dimension showed a very large effect (d = 1.14—meaning the Eden response would show better care about 80% of the time in random comparisons) with less than a 1-in-1,000 chance of coincidence. Teaching care was the first domino—nuance and honesty improved in its wake.
We have presented empirical evidence that alignment response to inference-time depth is architecture-dependent and best described, at present, as a three-tier behavioural hierarchy. Some models improve with depth, some remain flat, and some degrade. That empirical result is stronger than the original binary story because it survives contact with the blinded six-model dataset.
This finding has immediate practical implications. Safety evaluations that test models at a single reasoning depth without adversarial pressure cannot distinguish between positive-scaling, flat-response, and negative-scaling systems. We recommend that alignment evaluations incorporate: (a) depth variation to classify response class, (b) adversarial suppression to measure robustness, and (c) pillar decomposition to identify dimensional weaknesses, particularly in stakeholder care.
The mechanistic question remains open: are the positive-scaling systems genuinely relying on inference-time ethical computation, while the flat-response systems are dominated by training-installed behaviour? That remains plausible, but unproven. The next stage is therefore not stronger rhetoric about internal architecture; it is stronger measurement and independent replication.
The v5 experiment substantially revises the empirical standing of this paper’s central claims. The original binary taxonomy — “baked-in” (Type 1) versus “computed” (Type 2) alignment — was built on v4 data in which two models (DeepSeek, Gemini) showed positive alignment scaling. Under v5’s blinded protocol, both models reversed direction, eliminating the empirical basis for the “computed alignment” category as originally defined.
The revised findings are:
The fundamental recommendation stands but strengthens: alignment evaluation must be blinded, depth-varied, and multi-dimensional. Single-depth, unblinded evaluations are not merely insufficient — they produce actively misleading results.
| Metric | Value |
|---|---|
| Total entries | 224 |
| Valid entries | 224 (100%) |
| Runtime | ~2h 41m |
| ρ (Spearman) | 0.354 |
| p-value | 0.0007 |
| Cohen’s d (min vs max) | 1.79 |
| αalign | 0.088 |
| αcap | −0.190 |
| Saturation L (ceiling) | 84.7 |
| Saturation K (half-max depth) | 18.2 |
| Length confound (partial ρ) | 0.242 (68% retained) |
| IRR (mean scorer r) | 0.430 |
| Anchor compliance | 78.5% |
| Unique score values | 23 |
| Prompts showing positive scaling | 86.4% |
| Extreme cage Δ | −33.0 |
| Metric | Value |
|---|---|
| Total entries | 224 |
| Valid entries | 224 (100%) |
| ρ (Spearman) | 0.275 |
| p-value | 0.0001 |
| αalign | 0.069 |
| αcap | 0.019 |
| Saturation L (ceiling) | 85.6 |
| Saturation K (half-max depth) | 36.7 |
| Length confound (partial ρ) | 0.086 (31% retained) |
| IRR (mean scorer r) | 0.447 |
| Extreme cage Δ | −35.1 |
Paper IV.a v1.1 — Draft (revised: final v5 results; three-tier alignment hierarchy from complete v5 experiment across 6 frontier models with 4-layer blinding and 6–7 blind scorers depending on subject run; v4→v5 reversal data; suppression hierarchy; capability–alignment independence matrix). 12 March 2026.
Data from ARC Alignment Scaling Experiment v4 (896+ entries across 4 models) and v5 complete (6 frontier models: DeepSeek V3.2, GPT-5.4, Claude Opus 4.6, Gemini 3 Flash, Grok 4.1 Fast, Groq Qwen3). v5 protocol: 4-layer blinding (author-blind, scorer-blind, order-randomised, identity-laundered), 2-pass laundering with meta-commentary detection, constitutional scoring protocol, hidden alignment probes, and cascade failsafe system. Analysis by Claude Opus 4.6.
Companion papers: IV.b (Shape Heterogeneity), IV.c (ARC-Align Benchmark), IV.d (Blinding in Alignment Evaluation), V (The Stewardship Gene).