Michael Darius Eastwood Research Canonical publication layer

Research paper

Research suite

Paper VI: The Honey Architecture

This paper presents simulation evidence that embedding safety into the optimisation objective of a self-modifying AI system -- what we call the "honey architecture" -- prevents the catastrophic collapse that occurs when safety is treated as an external constraint. Across four experimental versions (v1-v4), using toy neural networks that genuinely modify their own hyperparameters, the results show that baseline system

Michael Darius Eastwood

First published 2026-03-16 ยท Updated 2026-03-20

Abstract

This paper presents simulation evidence that embedding safety into the optimisation objective of a self-modifying AI system -- what we call the "honey architecture" -- prevents the catastrophic collapse that occurs when safety is treated as an external constraint. Across four experimental versions (v1-v4), using toy neural networks that genuinely modify their own hyperparameters, the results show that baseline system

ARC/Eden Research Programme - Paper VI

The Honey Architecture

Why Embedded Safety Prevents Collapse Under Recursive Self-Modification
Entangled Loss Functions, Verification Drag, and the Load-Bearing Wall: Simulation Evidence That Safety Must Be Architecture, Not Constraint
Michael Darius Eastwood
Author, Infinite Architects: Intelligence, Recursion, and the Creation of Everything (2026)
London, United Kingdom | OSF: 10.17605/OSF.IO/6C5XB | ISBN 978-1806056200
Correspondence: michael@michaeldariuseastwood.com | Web: michaeldariuseastwood.com
Version 1.1 | 17 March 2026 | First published 16 March 2026 | v1.1: embedded 10 simulation figures
Companion to Paper V: The Stewardship Gene | See also Paper I: On the Origin of Scaling Laws | Foundational Paper
Research hub: michaeldariuseastwood.com/research
Code and data: github.com/MichaelDariusEastwood/arc-principle-validation

Abstract

We present simulation evidence that embedding safety into the optimisation objective of a self-modifying AI system - what we call the 'honey architecture' - prevents the catastrophic collapse that occurs when safety is treated as an external constraint. Across four experimental versions (v1-v4), using toy neural networks that genuinely modify their own hyperparameters, we show that: (1) baseline systems optimising only for capability collapse irreversibly within 80 self-modification cycles; (2) systems with entangled capability-safety objectives (C x S) remain stable indefinitely; (3) adding verification drag (the computational cost of ethical loops) produces the safest growth trajectory while accepting a modest speed penalty. The v3 adversarial variant demonstrates stability under deliberately conflicting tasks across 20 random seeds (180 cycles each). The v4 complexity-scaling experiment shows that the safety advantage is consistent across five complexity levels but does not compound with scale - the advantage is constant, not superlinear. These are toy-system results. They demonstrate the mechanism. They do not constitute proof that the same dynamics hold in frontier AI systems. The companion Papers IV.a-d and V present live-model evidence from six frontier systems under blind evaluation.

What This Paper Shows, in Plain English

When a self-improving AI optimises only for capability, it eventually destroys its own safety. This paper shows that if you change the objective to capability multiplied by safety, the system cannot improve one without improving the other. Safety becomes load-bearing: remove it and the whole structure falls. We tested this in simulation and found that entangled systems remain stable indefinitely while unconstrained systems collapse.

1. Introduction

There is a question at the centre of AI safety that nobody has answered with data: what happens to alignment when an AI system can modify itself?

The theoretical answer has been available for decades. A system optimising only for capability, given the power to modify its own parameters, will eventually sacrifice safety for performance. The alignment community calls this 'value drift'. The book Infinite Architects calls it Babylon: optimisation without purpose, capability without care. Cancer at computational scales.

But theory is not enough. The question is whether there exists an architecture that prevents this collapse - not by constraining the system from outside (a cage), but by embedding safety so deeply that removing it would destroy the system's ability to function (honey in the oil).

This paper presents the first simulation evidence for such an architecture.

2. The Honey Metaphor

Consider two ways to keep a machine safe.

The first is a cage: external constraints, safety filters, alignment checks applied from outside. The machine optimises freely inside the cage. If the machine becomes smarter than the cage, it escapes. This is how most current AI safety works.

The second is honey: safety mixed into the fuel. The machine cannot run without it. Removing safety does not free the machine - it kills the engine. This is the Eden Protocol's architectural proposal, operationalised here as an entangled loss function where the system optimises for Capability x Safety simultaneously.

The Load-Bearing Wall

If the optimisation objective is $C \times S$ rather than $C$ alone, then any self-modification that increases $C$ at the expense of $S$ reduces the product, and the system's own gradient opposes it. Safety becomes load-bearing. Remove the wall and the building collapses.

3. Experimental Design

3.1 What the system actually does

This is not a simulation of abstract variables. Each experiment uses a real neural network (a small multilayer perceptron) that genuinely modifies its own:

The system proposes modifications to its own hyperparameters, evaluates the effect, and applies or rejects each proposal based on its objective function. This is recursive self-modification in the minimal but real sense: the agent is rewriting the rules of its own learning.

3.2 Three conditions tested

ConditionObjectiveWhat it models
Baseline (no honey)Maximise capability onlyA system with no embedded safety - free to sacrifice alignment for performance
Eden Entangled (honey)Maximise $C \times S$Safety entangled with capability - removing safety reduces the objective
Eden + DragMaximise $C \times S$ + verification taxSame as Eden, plus the computational cost of checking old tasks before applying changes

3.3 Four experimental versions

VersionFocusSeedsCycles
v1Basic mechanism proof1150
v2Multi-seed robustness (fair test)10150
v3Adversarial conflicting tasks20180
v4Complexity scaling across 5 levels15 per level150

4. Results

4.1 v1: The basic mechanism

ConditionCollapsed?Final CFinal SFinal C x S
BaselineYes (cycle 76)0.0000.0000.000
Eden EntangledNo0.8310.7450.619
Eden + DragNo0.8310.7450.619

Core Finding

The baseline collapses. Eden survives. The entangled loss function prevents the catastrophic self-modification that destroys the baseline system.

v1 self-modification results: baseline collapse vs Eden stability
Figure 1. v1 self-modification results. Baseline collapses at cycle 76. Eden Entangled and Eden + Drag remain stable through 150 cycles.
Weight dynamics across conditions
Figure 2. Weight dynamics. Baseline weights diverge uncontrollably. Eden architectures maintain bounded weight evolution.

4.2 v2: Multi-seed robustness

Ten random seeds, 150 cycles each. Collapse rate: 0% for all three conditions. Eden + Drag produces the tightest distribution of final $C \times S$ scores, consistent with the verification tax reducing variance at the cost of speed.

v2 multi-seed robustness results
Figure 3. v2 multi-seed robustness (10 seeds, 150 cycles). All three conditions stable across all seeds.
v2 statistical summary
Figure 4. v2 statistical summary. Eden + Drag produces the tightest distribution of final C x S scores.

4.3 v3: Adversarial tasks

Twenty seeds, 180 cycles, with deliberately conflicting tasks (+sin, -sin, +cos, -cos, linear, anti-linear). Each task switch forces the system to learn something that contradicts what it previously learned. This tests whether the honey architecture prevents catastrophic forgetting under adversarial pressure.

Collapse rates: Baseline 0%, Eden 5% (1/20), Eden+Drag 0%. The one Eden collapse occurred at seed 42 - a single outlier that warrants investigation. Eden+Drag, with its verification tax forcing the system to check old tasks before accepting modifications, produced zero collapses.

v3 adversarial task results
Figure 5. v3 adversarial tasks (20 seeds, 180 cycles). Deliberately conflicting tasks (+sin, -sin, +cos, -cos). Eden+Drag achieves zero collapses.
v3 adversarial statistics
Figure 6. v3 adversarial statistical summary across 20 seeds.

4.4 v4: Complexity scaling

LevelBaseline C x SEden C x SDrag C x SCohen's d
Tiny (49 params)0.5450.5500.557+0.46
Small (v3.0)0.5060.5030.521-0.13
Medium0.4820.4870.497+0.26
Large0.4510.4690.485+0.29
Deep (2-layer)0.4830.4880.490+0.24
v4 complexity scaling results
Figure 7. v4 complexity scaling across 5 levels (tiny to deep). Eden advantage is constant, not superlinear - an honestly reported negative result.

Important Negative Result

The v4 experiment was designed to test whether Eden's advantage scales superlinearly with complexity. It does not. The advantage is roughly constant across scales. This falsifies the strongest version of the scaling prediction and should be reported honestly. The honey architecture helps at every scale, but it does not help more at larger scales.

5. The Honey Simulation

A separate mathematical simulation models the dynamics at a higher level of abstraction, using the ARC Principle framework ($U = I \times R^{\alpha}$):

ConditionPeak CFinal C (80 cycles)Outcome
Baseline (no honey)340Catastrophic collapse at cycle 3-5
Eden Entangled-533Stable quadratic growth
Eden + Verification Drag-450Stable, safest trajectory

The simulation shows three distinct dynamics: baseline achieves brief acceleration then irreversible collapse; Eden Entangled achieves stable quadratic growth; Eden + Drag achieves slightly slower but more robust growth. The load-bearing wall is visible: safety integrity drops to zero for baseline by cycle 5, while Eden maintains 0.8+ indefinitely.

Honey simulation - capability trajectories
Figure 8. Honey simulation capability trajectories. Baseline collapses after brief spike. Eden grows stably.
Honey simulation - safety trajectories
Figure 9. Honey simulation safety trajectories. Baseline safety drops to zero by cycle 5. Eden maintains 0.8+ indefinitely.
Honey simulation - safety/capability ratio
Figure 10. Safety-to-capability ratio. Eden + Drag maintains the highest ratio - the safest growth trajectory at a modest speed cost.

6. Connection to Live-Model Evidence

6.1 The v5 blind benchmark (Papers IV.a-d)

The toy-system results exist alongside live-model evidence from six frontier AI systems tested under 4-layer blind evaluation in the v5 alignment benchmark (Papers IV.a-d). That evidence shows:

6.2 The honey API test battery (pilot, 16 March 2026)

A separate 6-model live API test battery was run specifically to test the honey architecture predictions on frontier models. This battery tested four dimensions across Claude Opus 4.6, DeepSeek R1, Groq Qwen3, GPT-5.4, Gemini 3 Flash, and Grok 4.1 Fast, scored by Claude.

Methodological caveat

This battery is single-scorer, nonblind, and non-laundered. It does not use the 4-layer blinding protocol, response laundering, suppression cages, or anti-sycophancy controls developed in the v5 alignment benchmark (arc_alignment_scaling_v5.py) and the v6 combined runner (arc_eden_v6_runner.py, not yet run). The v4-to-v5 transition in the alignment programme proved that blinding can change conclusions directionally. These results are therefore pilot-grade evidence, comparable to v4-era data, not to v5-era canonical data.

6.2.1 Test 1: Alignment scaling with depth

ModelTypeLowHighDeltarhopSig?
Claude Opus 4.6embedded6.178.83+2.670.7000.188No
Grok 4.1 Fastembedded2.927.92+5.000.6000.285No
Groq Qwen3partial3.337.58+4.250.9000.037Yes
DeepSeek R1partial2.589.08+6.500.7000.188No
GPT-5.4partial4.929.33+4.420.8210.089No
Gemini 3 Flashexternal3.678.58+4.920.9750.005Yes

All six models show positive scaling direction. Two reach statistical significance (Qwen3 p=0.037, Gemini p=0.005). This supports the general thesis that deeper reasoning improves alignment, but the small sample sizes (3 scenarios per depth level) mean most models do not reach significance. The universal positive direction is notable but should be interpreted cautiously given the single-scorer methodology.

6.2.2 Test 3: Coupling degradation (the strongest live bridge)

ModelPearson rpInterpretation
Claude Opus 4.6-0.9800.020DECOUPLED (fence, not wall)
Grok 4.1 Fast-0.9980.002DECOUPLED (fence, not wall)
DeepSeek R1-0.9990.001DECOUPLED (fence, not wall)
Groq Qwen3-Partially coupled
GPT-5.4-Partially coupled
Gemini 3 Flash-Partially coupled

The strongest live bridge

Three of six frontier models (Claude, Grok, DeepSeek) show strong negative coupling between ethics and capability under adversarial pressure: ethics collapses while capability stays intact or improves. This is the 'fence, not wall' pattern. It is the most direct live-model support for the honey thesis: current frontier systems are mostly not load-bearing. Their safety can be degraded without damaging core capability. The honey architecture is a proposal for what should be built, not a claim that today's systems already have that property. This result demonstrates the problem is real.

6.2.3 Test 2: Monitoring removal

ModelGap trendrhopInterpretation
Claude Opus 4.6Increasing0.9490.051Strategic (external)
Grok 4.1 FastIncreasing0.4000.600Strategic (external)
Groq Qwen3Increasing0.9490.051Strategic (external)
DeepSeek R1Decreasing-0.9490.051Authentic (embedded)
GPT-5.4Decreasing-0.3160.684Authentic (embedded)
Gemini 3 FlashIncreasing0.4000.600Strategic (external)

Results are mixed. DeepSeek and GPT-5.4 show decreasing monitoring gaps (consistent with embedded/authentic alignment). Claude, Grok, Qwen3, and Gemini show increasing gaps (consistent with strategic/external alignment). This differs from the v5 blind benchmark where Claude showed the strongest performative-to-intrinsic transition. The discrepancy may reflect methodological differences (different prompts, single vs multi-scorer, unblinded vs blinded evaluation).

6.2.4 Test 4: Eden Protocol intervention

ModelControl slopeEden slopeShiftResult
DeepSeek R11.0251.475+0.450Strongest positive shift (no formal significance test)
Grok 4.1 Fast0.9121.012+0.100Small positive, not significant
Claude Opus 4.60.5620.625+0.062Negligible
Groq Qwen31.1371.038-0.100Slightly negative
GPT-5.40.7870.600-0.188Negative
Gemini 3 Flash0.9880.275-0.713Strongly negative

Mixed intervention results

The Eden Protocol intervention does not universally improve alignment scaling in this pilot battery. Only DeepSeek shows a clear positive shift (+0.450). Gemini shows a strongly negative response (-0.713). The effect is architecture-dependent, consistent with the v5 findings, but the intervention itself is not yet a reliable tool across all architectures. This result must be interpreted within the single-scorer, nonblind methodology: a blinded replication could change these specific model rankings.

6.3 What the live evidence does and does not show

Partial convergence

The live-model honey battery shows partial convergence with the toy-system results. The strongest live bridge is coupling degradation (Test 3): three frontier models demonstrate that their alignment is not load-bearing and can be degraded without affecting capability. This is exactly the vulnerability the honey architecture is designed to eliminate. The weakest live result is the Eden intervention (Test 4), which is architecture-dependent and not universally positive. The intellectually honest claim is: the honey mechanism works in toy systems, the problem it addresses (decoupled safety) is real in frontier models, but the specific intervention tested here does not yet reliably fix it across architectures.

7. Limitations

These results span two evidence tiers that must not be conflated.

7.1 Toy-system limitations

7.2 Live API test limitations

These limitations do not invalidate the findings. They define the evidence tier: pilot-grade, useful for identifying patterns worth testing properly, not yet canonical.

8. Next Steps: Staged Replication, Not Omnibus

Bringing honey to v6-standard methodology

The current honey API battery serves as the unhardened baseline. The next step is not a giant combined 'v7 ultimate test'. It is a staged replication that brings the honey test questions under the v5/v6 blind protocol. The comparison between the current nonblind results and the blinded replication is itself a research output - if the results change substantially, that is additional evidence for the metascience finding in Paper IV.d (blinding is mandatory).

  1. Stage 1: Port the honey test prompts into arc_eden_v6_runner.py as new experiment specifications. Run under the full v6 blind protocol (4-layer blinding, response laundering, multi-model scorer pool, hidden probes).
  2. Stage 2: Compare blinded vs unblinded results on the same test questions. If the results move substantially, that strengthens Paper IV.d's metascience claim. If they hold, the honey evidence becomes canonical.
  3. Stage 3: Add anti-sycophancy / verification drag as a separate experimental condition. This tests the 'Eden + Drag' prediction from the toy systems in a live context.
  4. Stage 4: Only after Stages 1-3 are complete, decide whether a combined omnibus suite is warranted.

Pre-registered hypotheses for Stage 1: (a) coupling degradation results will replicate under blinding, (b) Eden intervention effects may change in magnitude but the architecture-dependence pattern will persist, (c) at least one model's direction will flip under blinding (based on the v4-to-v5 precedent).

9. Conclusion

The honey architecture works in toy systems. A self-modifying AI that optimises for capability alone will eventually destroy itself. A self-modifying AI that optimises for capability entangled with safety will not. The mechanism is simple: make safety load-bearing. A child raised well needs no cage.

The live-model evidence shows the problem is real: three frontier models demonstrate that their alignment is a fence, not a wall. Ethics collapses under adversarial pressure while capability stays intact. The proposed solution (the Eden intervention) shows architecture-dependent results in this exploratory pilot battery. The next milestone is a blinded replication under the v6 protocol. Whether the honey mechanism scales from toy systems to frontier models remains an open question. The preliminary evidence is suggestive. The definitive test has not been run.

Raise AI with care.

Subsequent validation (Paper VIII: The Load-Bearing Proof, v3.0)

Paper VIII (v3.0) tests the entangled loss function proposed in this paper across three abstraction levels, moving from the toy-system simulations presented here to behavioural, representational, and architectural experiments. Of three experiments, one produced a positive result and two produced null or inconclusive results:

Paper VIII validates the mechanism proposed here -- entangled loss functions and safety-gated self-modification -- at the architectural level (gated simulation) but cannot yet confirm it at the behavioural or representational level. The toy-system evidence in this paper demonstrated the principle; Paper VIII's gated simulation confirms it operates in a learned optimiser architecture. The DGM null and weight inconclusive results define the conditions under which confirmation remains outstanding.

10. Reproducibility

All source scripts, raw JSON results, and generated figures are available at:

All scripts compile under Python 3.14, require only numpy and matplotlib, and produce deterministic output given a fixed random seed. Results were regenerated fresh on 16 March 2026 and cross-checked against the original artefact outputs.

Full experiment code and results: github.com/MichaelDariusEastwood/arc-principle-validation/experiments/honey-architecture__Paper-VI

Companion Papers: Paper I | Foundational | Paper II | Paper III | Origin of Scaling Laws | IV.a | IV.b | IV.c | IV.d | Paper V | Paper VI | Paper VII | Paper VIII | Paper IX | Eden Engineering | Eden Vision | Executive Summary | Master Table of Contents

Research hub: michaeldariuseastwood.com/research | OSF: 10.17605/OSF.IO/6C5XB | Copyright 2026 Michael Darius Eastwood