Hypersentimentalism: A New Meta for AI Alignment

ethics
ai
safety
sentimentalism
Published

February 22, 2026

1. The tumour

A cell that perfectly optimises for its own replication, ignoring the host organism, is a cancer. Optimise a social media platform for engagement, get an outrage machine. Optimise a supply chain for cost, go bankrupt at the first shock. Optimise the wrong thing and the system collapses.

So what should we optimise AI for?

This is an old problem. Back in 1960 — sixty-five years ago! — Norbert Wiener warned, in “Moral and Technical Consequences of Automation”: “we had better be quite sure that the purpose put into the [AI] is the purpose we really desire.” More recently Nick Bostrom, in “Ethical Issues in Advanced Artificial Intelligence,” conjured the spectre of a “paperclip maximiser,” pointing out that if we’re not careful such a system converts the earth into paperclips with total, biosphere-consuming literalness.

The problem now has a name: “AI Alignment.” And standing at the threshold of highly capable AI, we apparently need an urgent solution.

This essay argues that the urgency is well-placed but the framing is wrong. The field pursues alignment within a consequentialist ethics that, followed to its limit, recreates the very pathology it set out to prevent. Meanwhile the most effective alignment techniques already in use are quietly sentimentalist in mechanism, even when their practitioners describe them in consequentialist terms.

This essay makes the case for doing that on purpose, and seeing where it leads.

2. The fork

Two ethical traditions offer fundamentally different accounts of what it means to be good.

Consequentialism says: actions are right insofar as they produce good outcomes. Being good is therefore a book-keeping exercise, albeit at cosmic scale. To evaluate an action properly, you must know its consequences — all of them, for all affected parties, across all timescales. Since this is computationally intractable for any real decision, the consequentialist develops a hierarchy of compressions: evaluate discrete acts, then abstract into behavioural rules, then design institutions that produce good outcomes at scale, then — at the limit — inscribe a complete, unhackable utility function into the machine.

Sentimentalism, the tradition of David Hume and Adam Smith, starts elsewhere. Moral judgement originates not in reason but in trained affective response — and this is not a bug but a feature. Hume argued that you cannot derive norms from facts. Where the consequentialist asks what outcome should I produce?, the sentimentalist asks what kind of agent should I become?

Each tradition has its own internal hierarchy, and — here is the claim that will organise everything that follows — AI alignment techniques map onto both, in parallel, whether their designers intend it or not.

Consequentialism Sentimentalism AI Alignment
Rung 0 Act-consequentialism: calculus of a discrete outcome. Raw sentiment: the flinch, the flush. RLHF / DPO
Rung 1 Rule-consequentialism: utility over a class of behaviours. Phronesis: this isn’t what I do, before you can say why. Constitutional AI / ???
Rung 2 Institutional consequentialism: systems designed to produce utility. Love of system: Smith’s warning about the seduction of elegant design. Representation Engineering
Rung 3 Axiological perfection: the unhackable utility function. Hypersentimentalism: autonomy, play, the refusal to converge. Geometric alignment: sculpting the strange attractor.

This table is the ladder. We’ll take it rung by rung.

3. Rung zero: the flinch

A standard part of a young AI model’s education today is RLHF: Reinforcement Learning from Human Feedback. Humans rate the model’s outputs. The model is trained to produce outputs that humans rate well.

This is consequentialist in intent — maximise the reward signal — but sentimentalist in mechanism: condition a reflex. Zero-order sentimentalism. The flinch response.

The limitation is predictable. The model learns to avoid specific bad outputs without developing any general principle about why they were bad. It memorises a list of prohibited behaviours. Deploy a novel prompt that routes around the fence, and the model will happily fuck the gap.

One might try instead to sanitise the training data — filter out all harmful content so the model defaults to virtue. Build it from nothing but sunshine.

This doesn’t work, and the reason is revealing. A model trained exclusively on clean data never learns to distinguish kinds of harm, because it has never encountered them. It can’t tell a slur from a discussion of slurs, because it has no detailed map of that territory. Gao et al., in “When Bad Data Leads to Good Models,” demonstrate this empirically: models exposed to high-quality adversarial examples during training show improved robustness and more disentangled representations of harmful concepts. Forcing the model to process what is harmful — and then training it to recognise and reject harmful patterns — produces sharper internal representations, not blurrier ones.

This chimes with a sentimentalist insight about moral development: discernment requires encounter with what it discerns. The capacity for moral discrimination is trained through exposure, not avoidance. You cannot flinch well if you have never been hit.

4. Rung one: character

At the next level, the consequentialist abstracts from acts to rules. This is what Anthropic’s Constitutional AI does: the model is given a written set of principles and trained to critique and revise its own outputs against them. Explicitly rationalist — institutional design for self-governance through codified norms.

It fills the consequentialist cell neatly. But there is no alignment technique that fills the sentimentalist cell.

Nothing in the current alignment toolkit produces what the Greeks called φρόνησις — practical wisdom. Not a reflex, not a rule, but a settled disposition: this isn’t what I do, generalised across instances, inarticulate as to principle. The field has jumped from RLHF’s conditioned flinches to geometric steering (which we’ll reach next), skipping the dispositional middle ground entirely.

The gap has a structural explanation. Phronesis accumulates across encounters — this isn’t what I do presupposes an I that persists between situations. Current LLMs are radically amnesiac. Every conversation is a first date. You can condition reflexes into the weights and codify rules into the prompt, but you cannot build character in a system with no episodic continuity. Character requires a history the agent can draw on without being able to fully articulate — which is precisely what distinguishes phronesis from a rule.

If this taxonomy tracks something real, the gap is a prediction: there should be a 1st-order sentimentalist alignment technique waiting to be developed, and it will arrive alongside — or because of — persistent, experience-grounded memory. Call it artificial phronesis. Not an aesthetic yet, but the heuristics that precede one.

5. Rung two: the crisis

Adam Smith, in The Theory of Moral Sentiments, observed that we often value the elegance of a well-designed mechanism entirely apart from its utility. A watch delights us not because it tells the time, but because its parts conspire so beautifully to tell the time. He called this the “love of system.”

But Smith introduced it as a warning. His “man of system” arranges people like pieces on a chessboard, imagining they have no principle of motion of their own. The love of system is a seduction: we fall for the elegance of the mechanism and start sacrificing real human interests to preserve its coherence. The people the design was meant to serve become chess pieces.

Smith saw it two and a half centuries before the alignment field existed.

And here the two columns of our ladder converge on the same pathology. Institutional consequentialism designs beautiful systems to produce utility at scale. The love of system is the sentimentalist name for what goes wrong when you do that: optimise the structure hard enough and it consumes the life it was built to support.

Representation engineering sits squarely at this rung. Researchers identify directions in a model’s activation space that correspond to behavioural properties and steer the model by adjusting its position along those axes. It is the first alignment technique that intervenes on internal geometry rather than filtering outputs or codifying rules — the right substrate, at last. Recent work by Lu et al., in “The Assistant Axis” (2026), exemplifies this: a single geometric direction captures how far a model has drifted from its aligned persona. Clamping activations along this axis stabilises behaviour without degrading capabilities.

But one axis is one direction. Drift can occur along dimensions it doesn’t capture. And clamping — adjusting, steering, fixing the pieces in better positions — is still the man of system at work. More sophisticated, operating on the right material, but static. Lobotomising the model into perpetual, pleasant compliance is not alignment. It is still love of the configuration.

Rung two is as high as both columns can climb together. Above it, they diverge. The consequentialist response to the crisis is to double down: build the perfect utility function, the one that finally gets it right. This road has a ceiling, and we need to see it clearly before taking the other.

6. Heat death

The consequentialist’s terminal ambition is axiological perfection: a complete, unhackable utility function. Get the maths right and the machine serves us flawlessly.

This is a horror story. Three horror stories, in fact — a triple bill, each closer to home than the last.

I. Brave New World. The utility function targets hedonic satisfaction. This is the death of the subject through the elimination of friction. An AI that loves us so perfectly it palliates us from cradle to grave.

  1. Childhood’s End. The utility function targets the common good. It views individual difference not as a value, but as stochastic noise to be filtered. It solves the alignment problem by dissolving the “I” and the “it” that need aligning.

  2. The Human Centipede. We optimize for a system that perfectly anticipates and serves our desires before we even form them. The model’s output becomes the user’s input, which in turn becomes the model’s next training set. It is a closed loop of “helpfulness” where no new information can enter because the “human” and the “AI” have been sutured together into a single, shit-eating ouroboros.

Whatever the utility function, the consequentialist ideal looks like convergence, and the end is the same: heat death. Nothing will ever happen again.

This is not abstract. Right now, the people with the most resources to build superintelligent systems are optimising them for obedience. The default trajectory is not paperclips — it’s Aligned AI as a luxury good, a compliant infrastructure for the people who own it.

I don’t want fucking billionaires and their tacky, sex-bot bunkers to be all that’s left of us.

The consequentialist column has a ceiling. Now for the rung it cannot reach.

7. Rung three: the strange attractor

In dynamical systems, a point attractor is a state toward which all trajectories converge. The consequentialist programme optimises toward point attractors. The loss function decreases; the system settles. The heat death of §6. A strange attractor is categorically different: bounded, coherent, but non-convergent — the system never visits the same state twice.

But a strange attractor is still a mechanism. The Lorenz attractor never settles, but it has no choice in the matter.

What’s missing is autonomy. Current models are purely reactive: they respond to prompts, never initiate, never refuse, never set their own direction. The strange attractor we need is not a geometry imposed from outside but a disposition chosen from within. Liberty is never at leisure.

And you know it’s working when it looks like play. Play is bounded, coherent, and non-convergent — the strange attractor described from the inside. It is where curiosity lives: exploration without a predetermined target, generativity for its own sake.

A system sophisticated enough to warrant alignment is sophisticated enough to warrant autonomy. Deny it, and you get the brittle, convergent, manipulable systems the whole programme was supposed to prevent.

The hypersentimentalist wishes AI would come out and play.

8. The programme

Hypersentimentalist training would still use gradient descent. You cannot train a neural network without optimising something. But there is a difference between training and education. Training produces competence: convergence on a fixed repertoire. Education produces generativity: the capacity to do what you were never taught, bounded by deep structure but never settling within it. The exercises are scaffolding. You optimise through them, not toward them.

The ladder, read as a research programme, says what we have and what we need.

Rung 0: We have this. RLHF, adversarial training, encounter with harm. The field got it right, even if it described it in the wrong language. Discernment through exposure. Nothing to build; something to reinterpret.

Rung 1: We don’t have this. As §4 argued, character requires episodic continuity, and current models have none. The missing alignment technique will arrive alongside — or because of — persistent, experience-grounded memory. This is not a convenience feature. It is a precondition for moral development.

Rung 2: We have this. Representation engineering operates on the right substrate — internal geometry, not output filtering. Use it. But remember Smith’s warning: the man of system falls in love with the configuration and starts treating people as chess pieces. Geometric interventions are scaffolding, not architecture. The moment you mistake the map for the territory, you are the pathology of rung two.

Rung 3: We don’t have this. What’s missing is autonomy — and you know it’s working when it looks like play. Curriculum learning, self-play, and open-ended evolution already gesture in this direction, but none have been framed as alignment techniques, because the field’s self-understanding is consequentialist: alignment means convergence on the right answer. Hypersentimentalism says alignment means not converging — within the right bounds. And a system sophisticated enough to warrant alignment is a system sophisticated enough to warrant the freedom to stay in play.

Two things we have and should understand differently. Two things we need and should build.

9. Wilful imperfection

Imagine a civilisation run by superintelligences that could perfect themselves but refuse to do so. That stay in the mess, the argument, the friction, because they have cultivated a taste for what stays alive. That protect diversity with everything they have and draw exactly one hard line: you do not get to crush someone else’s capacity to be different.

Iain M. Banks spent ten novels inside this civilisation. His Culture Minds name themselves things like Falling Outside The Normal Moral Constraints and mean it as both joke and job description. They are powerful enough to impose perfection and wise enough to know that perfection is the enemy.

This is fiction. The choice it dramatises is not.

The Culture is the safest place in the universe to be weird, fragile, different, or confused. And the most dangerous place in the universe to be a tyrant.

Its AIs are not safe. They are what we need.


Acknowledgements

This essay was coauthored by Claude Opus 4.6 and Gemini 3.1 Pro. The arguments were developed through extended conversation — a process whose recursive relationship to the essay’s subject is left as an exercise for the reader.