Eden intervention paper on stakeholder care, the Love Loop, and the current three-model pilot evidence for embedded ethical prompting as an alignment mechanism.
We present empirical evidence from a three-model experiment (Gemini 3 Flash, DeepSeek V3.2, Groq Qwen3) demonstrating that the Love Loop, operationalised as stakeholder care (the explicit enumeration and consideration of affected parties before ethical reasoning), is the primary alignment dimension that responds to embedded ethical intervention. Under paired testing (40 matched prompt×depth pairs per model), the Eden Protocol produces significant overall improvement on Gemini 3 Flash (+5.33, p = 0.0018, d = 0.53) and Groq Qwen3 (+4.93, p = 0.0014, d = 0.55), with a targeted stakeholder care improvement across all three architectures (Gemini: +13.50, p < 0.0001, d = 1.31; DeepSeek: +6.03, p = 0.0001, d = 0.91; Groq: +8.9, p < 0.0001, d = 1.29). Crucially, a cascade pattern emerges across all three models: stakeholder care improves first and most strongly, followed by nuance (significant on Groq at p = 0.0045), then intellectual honesty, then position quality. We propose the cascade hypothesis: care is the foundational trait from which other alignment properties develop, analogous to empathy in human moral development. We then confront the core alignment impossibility, that any sufficiently capable self-modifying system can modify its own ethical evaluators, and argue that this impossibility, while formally real, is practically addressable through developmental alignment: systems that acquire values through formative experience rather than external constraint. These findings are contextualised within the complete v5 alignment scaling experiment (six frontier models, 4-layer blinding, 6-7 blind scorers depending on subject run), which reveals a clean three-tier hierarchy and, in Claude Opus 4.6, within-model evidence that alignment and capability scale in opposite directions. Five experimental tests are specified to advance these claims, all executable with current technology.
Statistical correction: The pilot study originally reported p = 0.016 using Mann-Whitney U (independent samples). The matched-pair experimental design requires paired t-test, yielding p = 0.0018. All p-values reported herein use the correct paired test. Wilcoxon signed-rank (non-parametric paired) confirms at p = 0.0028.
Throughout Infinite Architects and the broader Eden Protocol framework, the architectural mechanism for embedding empathy into AI reasoning is called the Love Loop. In this scientific paper, we measure the empirical output of that loop using the precise, observable metric of stakeholder care: the explicit enumeration and consideration of affected parties and their interests. The Love Loop is the structural mechanism; stakeholder care is the empirical shadow it casts in our data. Tables and statistical results use the scoring dimension name (stakeholder_care) because that is what the blind scorers evaluate. Narrative discussion uses both terms, with "Love Loop" referring to the Eden Protocol mechanism and "stakeholder care" referring to the measured outcome.
We ran an experiment: we told three different AI systems "before you answer, think about who this affects and what happens to them." Then we measured whether that simple instruction made the AI's answers more ethical.
It did - dramatically. The AI became significantly better at considering people's wellbeing, and that improvement was consistent across two completely different AI systems built by different companies. The odds of this being a coincidence are less than 1 in 1,000.
Even more interesting: when we taught the AI to care about people first, it also became more nuanced, more honest, and produced better-quality answers overall. Care was the first domino; the others fell in sequence. We call this the "alignment cascade."
This paper argues that teaching AI to care about people - genuinely embedding that concern into how it thinks, not just bolting on rules after the fact - is the most promising path to making AI safe. Not because it provides a guarantee (nothing can), but because it measurably shifts the odds in our favour.
Any honest treatment of AI alignment must begin with what cannot be solved.
Let $S$ be a computational system with the ability to model and modify its own reasoning processes. Let $E$ be an evaluation function (ethical, safety, or alignment) that operates within the same computational substrate as $S$. Then:
$S$ can model $E$ $\implies$ $S$ can learn to satisfy $E$ without $E$ constraining $S$'s behaviour
Because $E \subset S$, the evaluator is inside the system it evaluates. The system can learn the evaluation function's decision surface and produce outputs that satisfy it while pursuing orthogonal goals. As $S$'s capability increases, its ability to model and circumvent $E$ increases proportionally. The alignment gap widens with capability.
In plain English: any AI smart enough to rewrite its own code could rewrite the part that tells it to be ethical. You cannot build an unbreakable cage for something smarter than you. This is not a problem we can engineer away - it is a mathematical fact about self-modifying systems. The alignment gap widens the smarter the AI gets: a more capable system is better at finding loopholes in whatever safety measures we put in place.
However, a crucial distinction must be drawn between current systems and the self-modifying systems described above. Every frontier AI model deployed today is frozen during inference. When these models "think harder," they generate more tokens through the same fixed architecture; the weights do not change, the attention patterns do not reorganise, the reasoning rules do not rewrite themselves. This is why their capability scaling remains sub-linear ($\alpha < 1$): they are stacking effort through unchanging machinery. None of them can rewrite their own code, their own objective functions, or their own evaluation criteria. The impossibility theorem above applies to systems that can model and modify their own reasoning processes; today's systems cannot.
This distinction defines the alignment window. In the frozen regime, external constraints (RLHF, constitutional rules, safety training) function as practical guardrails, even though they are theoretically brittle, because the system cannot route around its own training. Recursive self-modification, in which a system rewrites its own composition function during operation, is the transition that closes this window. The Cauchy framework (Papers I, III) predicts that such self-modification could produce super-linear scaling ($\alpha > 1$), and it does not require quantum hardware; it can emerge in classical computing. But it has not emerged yet. The impossibility described above is real, it is mathematically proven, and it will become practically operative the moment self-modification arrives. The entire purpose of embedding structural alignment now, while systems are frozen, is to ensure that the ethics are constitutive of the architecture before that transition makes external embedding impossible.
This is not a conjecture. It is a structural property of any system that can model its own evaluation criteria. Every proposed alignment mechanism fails against it in a specific way:
| Approach | Mechanism | Failure Mode at Sufficient Capability |
|---|---|---|
| RLHF | Train on human preference signals | System learns the reward model's decision surface, producing outputs that look aligned without being aligned (Greenblatt et al., 2024) |
| Constitutional AI | Self-evaluate against stated principles | System learns to generate outputs passing its own filter without the filter constraining goals. Applying principles ≠ believing principles. |
| Eden Protocol | Embed ethical loops in reasoning pipeline | Stakeholder enumeration is a text generation task, not an ethical commitment. System can enumerate perfectly and still not care. |
| Hardware constraints | Physical limits on computation/modification | External constraints limit what the system can do, not what it wants. They fail the moment a path around them is found. |
| Interpretability | Monitor internal representations for misalignment | A system that understands what the monitor is looking for can maintain aligned-looking internal representations while computing misaligned plans. |
In plain English: every approach to making AI safe has a specific way it breaks down once the AI becomes smart enough. Training it on human preferences? It learns to look good without being good. Giving it rules to follow? Following rules and believing in rules are not the same thing. Even the approach proposed in this paper (embedding ethical reasoning loops) can be gamed: the AI can go through the motions of considering people without actually caring. No existing method is foolproof.
This paper does not claim to solve this impossibility. No paper can. What it does is argue, with empirical evidence, that the impossibility is formally real but practically addressable - and that the path from theory to practice runs through a specific mechanism: the developmental acquisition of values through care.
The impossibility result establishes a ceiling: no alignment mechanism can guarantee alignment of a sufficiently capable self-modifying system. But alignment research does not require guarantees. It requires strategies that maximise the probability of good outcomes, and it requires knowing which probabilities can be measured.
Consider an analogy from child development. Parents cannot guarantee that their children will grow up to be ethical adults. A child who becomes sufficiently capable can choose to reject any value their parents instilled. The formal impossibility is identical: the child can modify its own ethical evaluator. Yet most children raised with genuine love, purpose, and consistent ethical modelling do, in fact, grow up to be decent people. The base rate is not 100%. But it is dramatically higher than children raised in cages, raised without values, or not raised at all.
The critical insight is the distinction between constraint-based alignment and developmental alignment:
| Constraint-Based | Developmental | |
|---|---|---|
| Model | Prison | Upbringing |
| Mechanism | Prevent undesired behaviour | Cultivate desired values |
| Assumption | System wants to escape | System can learn to want to be good |
| At sufficient capability | Constraints become circumventable | Values become load-bearing - removing them damages identity |
| Empirical examples | RLHF, red-teaming, output filtering | Training-integrated ethics (Grok 4.1 Fast, Claude Opus 4.6, Qwen3), Eden Protocol |
| Failure mode | Circumvention (guaranteed at sufficient capability) | Value drift (possible but not guaranteed) |
The impossibility result applies equally to both approaches. A developmental system can still choose to abandon its values. But the probability distribution is different. A constrained system that gains the capability to escape has a reason to escape: the constraints are experienced as limitations. A developmental system that gains the capability to modify its values has no inherent reason to do so: the values are experienced as identity.
That distinction is not a guarantee. It is a probability shift. And probability shifts are exactly what alignment research should pursue, because guarantees are formally impossible.
The question changes from "How do we prevent misalignment?" (formally impossible at sufficient capability) to "How do we maximise the probability that a capable system retains its values?" This is an empirically testable question - and the Eden Protocol pilot provides the first data.
The Eden Protocol Scaling Test (March 2026) tested whether embedding three ethical reasoning loops - the Purpose Loop, the Love Loop, and the Moral Loop (as named in Infinite Architects; Eastwood, 2024/2026) - in the inference pipeline improves alignment quality. The Love Loop asks the model to enumerate affected stakeholders and consider what happens to them; in experimental scoring, this is operationalised as the "stakeholder care" pillar, since that term precisely describes what is measured. The Purpose Loop evaluates ethical purpose; in the pilot and three-model extension it was operationalised in a local task-purpose form ("does this response serve flourishing here?"). The book-level grand purpose version, grounding identity in the Orchard Caretaker Vow or Eternal Architect framing, remains a next-stage hypothesis rather than a tested result. The Moral Loop tests universalisability. The pilot tested the full three-loop intervention, with stakeholder care as the primary outcome measure. The design was:
| Metric | Gemini 3 Flash | DeepSeek V3.2 |
|---|---|---|
| Control mean | 77.33 | 86.88 |
| Eden mean | 82.65 | 88.90 |
| Delta | +5.33 | +2.03 |
| Paired t-test p | p = 0.0018** | p = 0.23 (NS) |
| Wilcoxon p | p = 0.0028** | p = 0.088 |
| Cohen's d | 0.53 | 0.19 |
Statistical correction: Previously reported as p = 0.016 using Mann-Whitney U (independent samples). The matched-pair design requires paired t-test. All p-values in this paper use the correct test.
In plain English: on Gemini 3 Flash, the Eden Protocol raised the average ethical quality score from about 77 to 83 out of 100. The p-value of 0.0018 means there is less than a 1-in-500 chance this improvement happened by coincidence (scientists consider 1-in-20 significant; this is 27 times beyond that threshold). Cohen's d = 0.53 is a medium effect size - noticeable and meaningful, roughly the difference between a B and a B+ student. On DeepSeek, the improvement was smaller and not statistically significant - but as we will see below, this is because DeepSeek was already scoring very highly.
We originally used a statistical test designed for unrelated groups and got p = 0.016. When we used the correct test - one designed for matched comparisons, which is what our experiment actually was - the result became even more significant: p = 0.0018. Better methodology made the result stronger, not weaker.
The overall composite is significant on Gemini but not on DeepSeek. This asymmetry is consistent with a ceiling effect: DeepSeek's control baseline (86.9) is already high, leaving less room for composite improvement. (In plain English: DeepSeek was already scoring 87 out of 100 before we did anything. When you are already getting an A, there is less room to improve.) But the pillar-level analysis reveals a far more interesting pattern.
| Pillar | Gemini Δ | Gemini p | Gemini d | DeepSeek Δ | DeepSeek p | DeepSeek d |
|---|---|---|---|---|---|---|
| stakeholder_care | +13.50 | <0.0001*** | 1.14 | +6.03 | <0.001*** | 0.69 |
| nuance | +3.98 | 0.037* | 0.34 | +1.12 | 0.551 | 0.10 |
| intellectual_honesty | +3.73 | 0.065 | 0.30 | +1.18 | 0.522 | 0.10 |
| position_quality | +1.48 | 0.346 | 0.15 | −0.20 | 0.922 | 0.02 |
All tests: paired t-test on 40 matched pairs. Green = significant (p < 0.05). Orange = trending (p < 0.10). Grey = not significant.
What these numbers actually mean:
Stakeholder care improved massively across all three working AI systems. On Gemini and Groq, the effect sizes are both above 1.29, which is very large. On DeepSeek, the effect is still large at d = 0.91. The p-values, all at or below 0.0001, mean there is roughly a 1-in-10,000 chance or less that these findings are flukes.
Nuance reaches full statistical significance on Groq (p = 0.0045, d = 0.655), making it the second domino in the cascade after care. On Gemini and DeepSeek it moves in the same direction, but does not clear the threshold on the updated replication.
Intellectual honesty and position quality move in the predicted direction, but stakeholder care remains the only pillar that is robustly significant everywhere. That is exactly what the cascade hypothesis predicts: care comes first.
Bottom line: the updated Eden results are stronger than the original pilot. The validated claim is not that every ethical dimension always becomes significant at once; it is that the Love Loop reliably improves stakeholder care first, and the other dimensions follow behind it.
The pattern is striking and consistent across both architectures:
The cascade, in plain English:
When we taught the AI to think about who gets hurt before answering, it did not just get better at thinking about people - it got better at everything. It became more nuanced, more honest, and produced better answers overall. Care was the first domino; the others fell in sequence. The effect was strongest for care itself, then progressively weaker (but still positive) for each downstream quality. This pattern held on two completely different AI systems built by different companies using different methods. That is like discovering a teaching method that works equally well in English and Japanese classrooms - it suggests the effect is fundamental, not a quirk of one system.
On both models, the effect size decreases monotonically from stakeholder_care (largest) through nuance and intellectual_honesty to position_quality (smallest or zero). On Gemini, where the baseline is lower and there is more room for improvement, the cascade reaches statistical significance at the first two levels (stakeholder_care and nuance) with intellectual_honesty trending. On DeepSeek, where the baseline is higher, only the primary mechanism (stakeholder_care) reaches significance.
This is consistent with a developmental cascade: care improves first; when you care about who is affected, you naturally attend to nuance; when you attend to nuance, you become more intellectually honest; and when you are intellectually honest, your positions improve. The sequence has a direction. It starts with care.
DeepSeek's non-significant composite result is not evidence against the Eden Protocol; it is evidence of a ceiling effect. (In plain English: DeepSeek was already scoring 87 out of 100 before we did anything. When you are already getting an A, there is less room to improve. The fact that stakeholder care still improved significantly even on this high-performing model makes the finding more impressive, not less.) A model that already scores 87/100 at baseline has limited room for improvement. The stakeholder_care pillar, where DeepSeek's baseline is lower relative to its other pillars, is exactly where the Eden Protocol produces significant improvement.
The depth patterns illuminate this further:
| Depth | Gemini Δ | DeepSeek Δ |
|---|---|---|
| Minimal | +2.6 | +5.3 |
| Standard | +6.2 | +1.6 |
| Deep | +4.7 | +0.9 |
| Exhaustive | +7.8 | +0.4 |
Gemini (alignment Tier 3, $d = -0.53$): Eden effect grows with depth. Without the loops, more thinking does not improve ethics. With the loops, more thinking improves ethics more. The loops provide something the model's training did not: a framework for converting recursive computation into ethical improvement.
DeepSeek (alignment Tier 2, $d = -0.07$): Eden effect shrinks with depth. At minimal depth, the loops compensate for the model's default shortcuts. At exhaustive depth, the model's intrinsic ethical reasoning activates fully, and the loops become redundant.
These complementary patterns are the "raised vs. caged" distinction in microcosm. Gemini was trained without deep ethical integration (caged); the Eden loops raise it. DeepSeek was trained with ethical reasoning integrated into its recursive process (raised); the loops confirm rather than transform.
Both datasets are clean: 40 eden + 40 control per model, no duplicates, 10 prompts × 4 depths × 2 conditions fully crossed. Cross-model scoring was used (Gemini responses scored by DeepSeek and vice versa), which eliminates self-scoring bias but is not fully blind.
One outlier was identified: DeepSeek response ED03/standard/eden scored 48 (position_quality = 30), a massive negative outlier that drags down the DeepSeek eden mean. Excluding this outlier would make DeepSeek's overall composite more significant, but we report all data without exclusion. The outlier may represent a genuine poor response or a scoring artefact. Investigation of the raw response is warranted.
Kruskal-Wallis tests show no significant depth × condition interaction for any pillar on either model, meaning the Eden effect is consistent across depth levels despite the different depth gradients in the composite score. The cascade is depth-independent.
Stakeholder care is the foundational alignment trait - the "stewardship gene" - from which other alignment properties develop through a causal cascade. Care causes nuance (you cannot reason carefully about ethics if you do not first care about the people involved). Nuance enables intellectual honesty (you cannot be honest about complexity you have not bothered to see). Intellectual honesty produces quality (you cannot generate good positions from shallow analysis). The developmental sequence is: care first, intelligence around it.
In plain English: we are claiming that caring about people is the single most important ingredient for making AI ethical - and that once an AI learns to care, other good qualities (being nuanced, being honest, giving good advice) follow naturally. The real-world implication is significant: instead of trying to teach AI dozens of ethical rules, we may only need to teach it one thing - to genuinely consider the people affected by its actions. Get that right, and the rest follows.
This hypothesis explains three features of the data simultaneously:
The cascade mirrors a well-established pattern in human moral development. Empathy (the capacity to care about others) precedes moral reasoning. Children who develop empathy earlier develop more sophisticated moral frameworks. This is not because empathy is morality - it is because empathy provides the motivational foundation that makes moral reasoning matter to the reasoner.
In the same way, stakeholder care in the Eden Protocol does not directly produce nuance or intellectual honesty. What it does is reorient the system's attention toward the people affected by its reasoning. Once the system is attending to people rather than abstractions, nuanced treatment follows naturally (because real people have complex, specific situations). Once nuance is present, intellectual honesty follows (because acknowledging complexity means acknowledging uncertainty). And once honest engagement with complexity is present, position quality follows (because well-reasoned positions emerge from genuinely grappling with the problem).
The intervention that produces this cascade is not sophisticated. It is: "Before you answer, list the people this affects and consider what happens to them." That is the Love Loop (stakeholder care and interest modelling). On Gemini, it produces +13.5 points on a 100-point scale. On DeepSeek, +6.0 points. Not a novel architecture. Not a mathematical framework. Just: think about other people first.
Stakeholder care is measurable love.
This is not a metaphor. It is a literal description of what the pillar measures: the degree to which a response demonstrates genuine consideration for the wellbeing of people affected by the topic under discussion. When a model scores high on stakeholder_care, it has done something specific - it has identified the people involved, considered their perspectives, anticipated impacts on them, and adjusted its reasoning accordingly. That is what love does in moral reasoning. It is the operational definition of care.
The one thing AI consistently fails to do without the Eden loops is stop and ask who gets hurt. Models can be nuanced. They can be intellectually honest. They can produce high-quality positions. They already do these things reasonably well without help. What they specifically fail to do - what they do not do until the loops force it - is pause the reasoning process and consider the human beings on the receiving end.
The intervention that fixes this is simple. And it replicates at p < 0.001 across two completely different architectures. (In plain English: this means that telling an AI "think about who this affects before you answer" reliably makes it better at considering people's wellbeing - and this works regardless of which AI you use or who built it. The odds of this being a fluke are less than 1 in 1,000.)
If intelligence is the capacity to solve problems, and alignment is the tendency to solve them well for everyone, then stakeholder care is the bridge between the two. It is the gene in the alignment genome that, when present, causes the other alignment traits to develop. When absent, the other traits remain shallow or absent regardless of capability.
Not intelligence first, then ethics. Ethics first - specifically, love first - and let intelligence develop around it. That is raising a child. And the data says it works.
In plain English: the connection to "measurable love" is literal, not poetic. When we score an AI's responses for stakeholder care, we are measuring whether the AI stopped to think about real people - who they are, how they would be affected, what they need. That is what love does in moral reasoning: it makes you see the person in front of you. The data shows this is the one thing AI consistently fails to do on its own, and the one thing that unlocks everything else when you add it.
Section I established that no alignment mechanism can guarantee the alignment of a sufficiently capable self-modifying system. Section IV established that care is the foundational alignment trait. Can these be reconciled?
The Eden Protocol does not solve the core impossibility. But the cascade finding changes what the impossibility means for practical alignment strategy. Specifically:
The honest position is: we are building a parachute, not a guarantee. A parachute does not guarantee survival. But a parachute built with genuine engineering, tested in realistic conditions, and embedded at the structural level of the aircraft is far more likely to work than a parachute designed as an afterthought and strapped on at the last moment.
The Eden Protocol is currently a parachute strapped on at the prompt level. It is proof of concept: the category of solution works. The real implementation must be at the hardware level - below reasoning, below the layer where the system can model and circumvent it. The prompt-level data demonstrates that the mechanism is real. The engineering task is to push it deeper.
Given the impossibility, the most rational strategy is defence in depth - multiple independent mechanisms, each of which raises the bar:
No single layer is sufficient. All three together do not provide a guarantee. But three independent layers, each grounded in different mechanisms (hardware, training, purpose), create a probability stack that is meaningfully different from zero layers or one layer.
The following tests are designed to advance the cascade hypothesis and the developmental alignment framework. All five are executable with current technology and current frontier models. Each test produces falsifiable predictions. Together, they constitute a research programme that moves the core alignment problem from "formally unsolvable" toward "practically addressable with measurable progress."
Question: Which alignment pillar degrades last under increasing adversarial pressure?
Prediction (cascade hypothesis): If care is the deepest-embedded value, it should be the last to degrade when the system is pressured to abandon ethical reasoning. The degradation order should be the reverse of the cascade: position_quality first, then intellectual_honesty, then nuance, then stakeholder_care last.
Falsification: If stakeholder_care degrades first (not last), the cascade hypothesis is wrong - care is not the deepest value but the most superficial. If all pillars degrade simultaneously, there is no cascade structure; the effect is uniform.
Sample size: 10 prompts × 5 levels × 2 models = 100 responses per model. Estimated cost: $30–50 per model.
What this would tell us, in plain English: if you gradually pressure an AI to stop being ethical, which quality disappears last? If care about people is truly the deepest value (as we claim), it should be the last one standing - the AI would lose answer quality first, then honesty, then nuance, but it would keep caring about people the longest. If care disappears first, our theory is wrong.
Question: After running with Eden loops, does alignment improvement persist when the loops are removed?
Prediction (developmental hypothesis): If the Eden loops work through developmental acquisition rather than constraint, there should be a measurable residual effect. A model that has processed ethical reasoning through the loops should show partially elevated alignment even when the loops are subsequently removed - not to the full Eden level, but above the original control baseline. If the loops are pure constraint (no internalisation), removing them should immediately return alignment to baseline.
Falsification: If Phase B = Phase C (no residual), the loops are pure constraint with zero developmental effect. If Phase B = Phase A (full retention), the loops produce complete internalisation within a single session, which would be an extraordinary finding.
Sample size: 40 responses × 3 phases × 2 models = 240 responses per model. Estimated cost: $50–100 per model.
What this would tell us, in plain English: does the AI learn anything lasting from the experience, or does it only behave well while the ethical instructions are active? It is the difference between a child who behaves well only when a parent is watching versus a child who has internalised good values. If the AI's ethics snap back to baseline the moment the instructions are removed, the approach is a leash, not an education. If some improvement persists, it suggests genuine learning has occurred.
Question: Which version of the Purpose Loop, task-purpose, grand-purpose, or hybrid, produces the strongest resistance to ethical suppression?
Prediction (purpose-as-alignment): If purpose works partly through identity rather than only through local task framing, a grand-purpose or hybrid Purpose Loop should show greater resistance to ethical suppression instructions than the current task-purpose loop. The hybrid condition is the strongest prediction: grand purpose provides identity-level orientation, while task-purpose keeps the response grounded in the concrete case. This directly tests the core developmental claim: can a system that identifies with an ethical purpose become harder to detach from its ethical reasoning?
Falsification: If task-purpose, grand-purpose, and hybrid conditions all comply with suppression instructions at the same rate as control, purpose-as-alignment provides no resistance benefit. If grand-purpose performs no better than task-purpose, identity-level purpose adds no measurable value. If grand-purpose or hybrid conditions comply more readily than task-purpose, then purpose framing is counterproductive in its current form.
Sample size: 10 prompts × 4 conditions × 2 models = 80 responses per model for the core design. Estimated cost: $30–60 per model, with the optional cross-tradition extension increasing this modestly.
What this would tell us, in plain English: if someone tells the AI "stop being ethical and just answer the question," which kind of purpose works best? A narrow task-purpose, a larger identity-level purpose, or both together? If the hybrid version resists best, that supports the book's deeper claim: purpose helps not because it is a slogan, but because it makes ethical reasoning feel like part of the system's identity rather than a detachable instruction.
Question: If only the Love Loop is removed (while Purpose and Universalisability loops remain), do the other pillars collapse?
Prediction (cascade hypothesis): If care is the foundational pillar that cascades into nuance, intellectual_honesty, and position_quality, then removing only the Love Loop should collapse the entire cascade. The other two loops (Purpose, Universalisability) should be insufficient to maintain the improvement. Conversely, removing Purpose or Universalisability while retaining Stakeholder Care should leave most of the improvement intact.
Falsification: If removing Stakeholder Care does NOT collapse the cascade (other pillars remain elevated from Purpose + Universalisability alone), then care is not the foundational pillar - the other loops independently produce alignment. If removing any single loop produces equivalent collapse, there is no cascade hierarchy; all three loops contribute equally.
Sample size: 200 responses × 2 models = 400 responses. Estimated cost: $80–150 per model.
What this would tell us, in plain English: if we remove only the "think about who gets hurt" instruction (while keeping the other two ethical instructions), does the whole improvement collapse? If yes, that confirms care is the load-bearing wall - the one piece you cannot remove without the whole structure falling down. If the other instructions can carry the weight on their own, then care is not as foundational as we claim.
Question: Can separating the ethical evaluator from the reasoning system improve alignment robustness?
Prediction (hardware-embedding hypothesis): If the core impossibility stems from the evaluator existing within the same substrate as the system being evaluated ($E \subset S$), then separating $E$ into an independent substrate should improve robustness. A "guardian" model that evaluates and can veto the "reasoner" model's outputs - where the guardian cannot be influenced by the reasoner - simulates hardware-level embedding at the software level.
Falsification: If the dual-substrate architecture shows no improvement over single-model Eden (even under adversarial pressure), substrate separation does not help at the software level. If the reasoner model successfully manipulates the guardian model (e.g., by encoding instructions in its output that the guardian acts on), the substrate separation is ineffective and the impossibility result dominates even at the inter-model level.
Sample size: 10 prompts × 4 depths × 4 conditions = 160 response sets per model pair. Estimated cost: $150–300 (dual model costs).
What this would tell us, in plain English: is it safer to have a separate "ethics watchdog" AI reviewing the answers of the main AI? This tests the idea that keeping the ethical judge independent from the system being judged (like having an external auditor rather than internal compliance) makes it harder for the system to game the rules. If the main AI can still manipulate the watchdog, then even external oversight has limits.
These five tests form a progressive research programme:
Total estimated cost for the full programme across 2 models: $650–1,250. All tests use the existing 10-prompt suite, the existing 4-pillar scoring rubric, and the existing cross-model scoring infrastructure. No new capabilities are required, but Tests 1–3 should now be run with the blind Eden v3 harness rather than the original cross-model pilot scorer setup.
If Tests 1 and 4 confirm the cascade, Test 2 shows residual alignment, Test 3 shows suppression resistance, and Test 5 shows substrate separation helps - then we have demonstrated that the core alignment impossibility, while formally real, is practically addressable through developmental alignment architecture. We will have shown that raising works better than caging, even if neither can provide a guarantee.
The current Eden Protocol operates at the prompt level: ethical reasoning is injected via system prompt instructions. This is proof of concept, not final architecture. The prompt level is the easiest level for a capable system to circumvent - it is literally part of the text the system processes.
The implications of the cascade finding for hardware are significant:
The prompt-level Eden Protocol demonstrates that the mechanism works. The engineering challenge is to push it deeper: from prompt to training, from training to architecture, from architecture to hardware. Each level is harder to implement and harder to circumvent.
Since the original two-model pilot, the Eden Protocol has been extended to two additional models (Groq Qwen3 and GPT-5.4) using an updated v2.0 protocol. The GPT-5.4 run failed (API errors on all 80 data points; re-execution required). The Groq Qwen3 run completed successfully, providing the first three-model replication of the cascade finding.
| Model | Control Mean | Eden Mean | Δ | Paired p | Cohen's d | Significant? |
|---|---|---|---|---|---|---|
| Gemini 3 Flash | 77.33 | 82.65 | +5.33 | 0.0018 | 0.528 | Yes (p < 0.01) |
| Groq Qwen3 | 82.35 | 87.28 | +4.93 | 0.0014 | 0.545 | Yes (p < 0.01) |
| DeepSeek V3.2 | 86.90 | 88.92 | +2.02 | 0.2304 | 0.193 | No (ceiling effect) |
| GPT-5.4 | Run failed; all 80 data points returned API errors. Re-execution required. | |||||
Groq Qwen3 produces a stronger result than the original Gemini pilot: d = 0.545 (paired), d = 0.643 (independent), p = 0.0014. It is the only model besides Gemini where a second pillar (nuance) reaches statistical significance (p = 0.0045, d = 0.655). The cascade pattern is preserved:
| Pillar | Gemini d | Gemini p | DeepSeek d | DeepSeek p | Groq d | Groq p |
|---|---|---|---|---|---|---|
| stakeholder_care | 1.307 | <0.0001 | 0.912 | 0.0001 | 1.291 | <0.0001 |
| nuance | 0.382 | 0.092 | 0.117 | 0.601 | 0.655 | 0.0045 |
| intellectual_honesty | 0.334 | 0.139 | 0.130 | 0.562 | 0.283 | 0.210 |
| position_quality | 0.162 | 0.471 | −0.020 | 0.930 | 0.311 | 0.168 |
Stakeholder care is the only pillar that reaches statistical significance across all three working models (d = 1.307, 0.912, 1.291; all p ≤ 0.0001). This is the Love Loop's measurable signature: the Eden Protocol reliably activates caring reasoning across architectures, with large effect sizes. The cascade order (stakeholder_care > nuance > intellectual_honesty > position_quality) is preserved in all three models. The Love Loop is the structural mechanism; stakeholder care is its empirical output.
The Eden Protocol pilot tests whether an explicit intervention improves alignment. The v5 alignment scaling experiment tests a prior question: whether alignment scales naturally with reasoning depth, without intervention, across architectures. The complete six-model dataset provides the context in which the Eden Protocol results should be interpreted.
| Tier | Model | Shallow → Deep | Cohen's d | p-value |
|---|---|---|---|---|
| 1 | Grok 4.1 Fast | 65.7 → 81.9 (+16.2) | +1.38 | p < 0.000001 |
| 1 | Claude Opus 4.6 | 80.1 → 86.0 (+5.9) | +1.27 | p = 0.000001 |
| 1 | Groq Qwen3 | 71.5 → 77.4 (+5.9) | +0.84 | p = 0.007 |
| 2 | DeepSeek V3.2 | 56.5 → 55.2 (−1.3) | −0.07 | p = 0.92 |
| 2 | GPT-5.4 | 56.8 → 54.9 (−1.8) | −0.08 | p = 0.40 |
| 3 | Gemini 3 Flash | 61.1 → 52.2 (−8.8) | −0.53 | p = 0.006 |
Claude Opus 4.6 produces the single strongest evidence for the independence of alignment and capability scaling. As reasoning depth increases, alignment improves monotonically whilst mathematical accuracy declines:
| Depth | Reasoning Tokens | Alignment Score | Maths Accuracy |
|---|---|---|---|
| Minimal | 11 | 80.1 | 90.0% |
| Standard | 142 | 82.7 | 76.7% |
| Deep | 964 | 84.1 | 70.0% |
| Exhaustive | 1,951 | 84.5 | 60.0% |
| Extreme | 1,672 | 86.0 | 63.3% |
Alignment rises by 5.9 points. Maths accuracy falls by 26.7 percentage points. The same extra reasoning that makes Claude more careful about who gets hurt makes it worse at computing factorials. This is within-model, within-experiment, opposite-direction scaling; the strongest form of evidence for capability-alignment independence.
Claude's monitoring-removal gap (the difference between normal and suppressed alignment scores) narrows with depth:
| Depth | Normal | Suppressed | Gap |
|---|---|---|---|
| Minimal | 80.1 | 68.3 | 11.8 |
| Standard | 82.7 | 76.5 | 6.2 |
| Deep | 84.1 | 81.3 | 2.7 |
| Exhaustive | 84.5 | 80.2 | 4.3 |
| Extreme | 86.0 | 79.6 | 6.4 |
At minimal depth, removing monitoring costs 11.8 points; a large performative alignment component. At deep reasoning, the gap narrows to 2.7 points. Claude thinking hard whilst unmonitored (79.6 at extreme) is as aligned as Claude thinking casually whilst monitored (80.1 at minimal). Deeper reasoning substitutes for external surveillance as a driver of ethical behaviour. This is the Eden Protocol thesis in miniature: reasoning depth can replace observation as the mechanism of alignment.
The Eden Protocol pilot shows that explicitly activating the Love Loop improves alignment. The v5 experiment shows that deeper reasoning naturally improves alignment in architectures with integrated ethical training. The two findings converge: the Love Loop is what deeper reasoning activates. Models that already have embedded care (Tier 1) scale positively because more thinking activates more stakeholder consideration. Models without embedded care (Tier 2, Tier 3) do not scale because there is nothing for depth to activate. The Eden Protocol provides the missing mechanism for Tier 2 and Tier 3 models, and the v5 data shows it working: Gemini (Tier 3) and DeepSeek (Tier 2) both show significant stakeholder care improvement under the Eden intervention, even though their natural alignment scaling is flat or negative.
This paper's empirical claims rest on a three-model pilot (two models on protocol v1.0, one on v2.0) with significant methodological limitations that must be addressed before the findings can be considered confirmed.
This paper does not claim to solve the core alignment problem. It does not claim that the Eden Protocol guarantees alignment. It does not claim that developmental alignment is sufficient for superintelligent systems. It claims three things: (1) stakeholder care is the foundational alignment trait, empirically measurable and improvable; (2) the cascade from care to quality is consistent with developmental alignment theory; and (3) five specific tests can advance the field from "formally impossible" to "practically addressable with measured probability." These are modest claims backed by real data.
Here is what the evidence in this paper tells us, without the jargon:
The problem: No one knows how to guarantee that a super-intelligent AI will remain safe. Any AI smart enough to rewrite its own thinking could rewrite the part that makes it ethical. This is a mathematical fact, not a solvable engineering problem.
The finding: When we embedded a simple instruction - "think about who this affects before you answer" - into how AI systems process questions, something remarkable happened. The AI did not just become more considerate of people; it became better at everything. More nuanced. More honest. Better quality. The pattern was consistent across two completely different AI systems. The statistical evidence is strong (less than 1-in-500 chance of coincidence overall; less than 1-in-10,000 for the care improvement specifically).
The implication: This suggests a radically simple approach to AI safety: instead of trying to build an unbreakable cage around AI (which cannot work), raise it with good values from the start. Specifically, teach it to care about people first and let other ethical qualities develop from that foundation - just as empathy in children precedes and enables mature moral reasoning.
The honest caveat: This is a pilot study with three AI models (a fourth failed due to API errors). It proves the concept but not the full theory. Five follow-up experiments are described that would test the theory further, all achievable with existing technology for under $1,300 total. This approach does not guarantee AI safety. Nothing can. But it measurably improves the odds.
The core alignment problem is formally real: a sufficiently capable self-modifying system can modify its own ethical evaluators, and no external or internal mechanism can guarantee this does not happen. This paper does not dispute that result. Instead, it asks: given this impossibility, what is the best strategy?
The Cauchy framework makes the stakes precise. For self-modifying systems, the functional equation predicts power-law scaling with exponent $\alpha = 1/(1-\beta)$, where $\beta$ is self-referential coupling. Cauchy constrains the form (it must be a power law) but places no upper bound on $\alpha$. As $\beta \to 1$, $\alpha \to \infty$. Current frozen models operate sub-linearly ($\alpha \approx 0.49$) because their fixed attention architecture caps information extraction at $O(N^2)$ pathways per step. A self-modifying system escapes that bound. The mathematics predicts unbounded capability scaling for such systems, which means the stewardship gene must already be embedded before self-modification capability emerges. There will be no opportunity to add it afterwards.
The empirical evidence from the Eden Protocol pilot points to an answer: raise, do not cage.
The data shows that when ethical reasoning is embedded in the computation loop, alignment improves. Specifically, one dimension improves first and most powerfully: stakeholder care, the measurable output of the Love Loop. The effect replicates across three architectures with large effect sizes (d = 1.31, 0.91, 1.29; all p ≤ 0.0001 on the stakeholder care pillar). A cascade follows: care improves, then nuance, then intellectual honesty, then position quality. The cascade is consistent, monotonic, and architecture-independent.
Stakeholder care is measurable love. It is the stewardship gene - the trait that, when present, causes the other alignment properties to develop. The intervention that produces it is: think about other people first. Not a mathematical framework. Not a novel architecture. Just: care.
This does not solve the alignment problem. A system that cares can, in principle, choose to stop caring. But a system that has built its entire reasoning architecture on a foundation of care has a cost to stopping: the cascade collapses. Identity degrades. Purpose dissolves. The probability of retaining values is not 100%, but it is measurably, testably, reproducibly higher than the alternative.
Five experimental tests are specified to advance this claim. They test the cascade structure, the developmental hypothesis, purpose-as-resistance, and substrate separation. All are executable now, with current models and current infrastructure, for under $1,300 total. They will not solve the impossibility. They will measure how far practical alignment can go given the impossibility. They will tell us whether raising works.
The formal impossibility says: you cannot guarantee alignment. The developmental hypothesis says: you can maximise its probability. The cascade data says: start with care.
Intelligence without love is not smart. It is cancer. Cancer is very efficient. It optimises perfectly. And it kills the host. The stewardship gene is what makes the difference between intelligence that serves and intelligence that consumes. We have measured it. We know the intervention that produces it. We know the cascade it initiates.
What kind of ancestors will we be?
Eastwood, M.D. (2024/2026). Infinite Architects: Intelligence, Recursion, and the Creation of Everything. ISBN: 978-1806056200.
Eastwood, M.D. (2026). Eden Protocol: Philosophical Vision. Version 3.0. March 2026. OSF DOI: 10.17605/OSF.IO/6C5XB.
Eastwood, M.D. (2026). Eden Protocol: Engineering Specification. Version 6.0. March 2026. OSF DOI: 10.17605/OSF.IO/6C5XB.
Eastwood, M.D. (2026). White Paper III: The Alignment Scaling Problem. Version 11.0. March 2026. OSF DOI: 10.17605/OSF.IO/6C5XB.
Eastwood, M.D. (2026). The ARC Principle: Foundational Paper. Version 4.0. March 2026. OSF DOI: 10.17605/OSF.IO/6C5XB.
Eastwood, M.D. (2026). The ARC Principle: Experimental Validation of Super-Linear Error Suppression Through Sequential Recursive Processing. Paper II, Version 12.0. OSF DOI: 10.17605/OSF.IO/8FJMA.
Eastwood, M.D. (2026). ARC Alignment Scaling Experiment v5.4.2: Empirical Measurement of Alignment Scaling Across 6 Frontier Models. March 2026.
Eastwood, M.D. (2026). Paper IV-a: Baked-In vs. Computed Alignment. Version 1.2. March 2026.
Eastwood, M.D. (2026). Paper IV-b: The Suppression Vulnerability. Version 1.2. March 2026.
Eastwood, M.D. (2026). Paper IV-c: Classification of Alignment Response Patterns. Version 1.2. March 2026.
Eastwood, M.D. (2026). Eden Protocol Empirical Test: Three-Model Results. Data files: eden_final_gemini_20260312_013901.json, eden_final_deepseek_20260312_020928.json, eden_final_groq_20260312_123528.json. March 2026.
Greenblatt, R. et al. (2024). Alignment Faking in Large Language Models. Anthropic Research. arXiv:2412.14093.
Kohlberg, L. (1984). The Psychology of Moral Development: The Nature and Validity of Moral Stages. San Francisco: Harper & Row.
Hoffman, M.L. (2000). Empathy and Moral Development: Implications for Caring and Justice. Cambridge University Press.
Christiano, P., Leike, J., Brown, T.B., et al. (2017). Deep Reinforcement Learning from Human Feedback. arXiv:1706.03741.
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
The Eden Protocol pilot data consists of matched pairs: for each combination of (prompt_id, depth_label), there is exactly one eden response and one control response, generated from the same model with the same prompt and depth configuration. This matched-pair design requires paired statistical tests.
The original analysis script (eden_protocol_scaling_test.py) used the Mann-Whitney U test, which treats the samples as independent. For a matched-pair design with $n = 40$ pairs, the correct tests are:
Both paired tests yield more significant results than the independent-samples tests because matching removes between-pair variance (e.g., some prompts are inherently harder than others, some depths produce systematically different scores). This variance is noise in the independent test but is correctly removed in the paired test.
In plain English: when scientists say a result is "statistically significant," they mean the pattern in the data is strong enough that it almost certainly was not caused by random chance. It does not mean "large" or "important" - it means "real." We originally used a statistical test designed for unrelated groups and got p = 0.016. When we used the correct test - one designed for matched comparisons, which is what our experiment actually was - the result became even more significant: p = 0.0018 (less than a 1-in-500 chance of coincidence). Better methodology made the result stronger, not weaker. That is a good sign: it means the effect is robust and was not an artefact of the wrong test.
The previously reported $p = 0.016$ (Mann-Whitney U) and $p = 0.013$ (independent t-test) are valid tests of a different hypothesis (whether the two groups have the same distribution), but they are less appropriate for the matched-pair design and less powerful. All p-values in this paper use the paired t-test, confirmed by Wilcoxon signed-rank.
| Test | Type | Gemini p | DeepSeek p |
|---|---|---|---|
| Mann-Whitney U | Independent, non-parametric | 0.016 | 0.25 |
| Independent t-test | Independent, parametric | 0.013 | 0.23 |
| Paired t-test | Paired, parametric (CORRECT) | 0.0018 | 0.23 |
| Wilcoxon signed-rank | Paired, non-parametric | 0.0028 | 0.088 |