Created: June 17, 2024
Modified: June 17, 2024

human values

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

value alignment research often frames the problem as:

first, learn the human 'value function' --- for every possible state of the world, how good do humans consider that state to be?
if we learn the right function, then we can feed this to a (superintelligent) AI to make the world better
if we learn the wrong function, we all die

some deep issues come up here:

whose values? individual humans have different interests, and even disagree on principle about the sort of world they'd like to live in
do you work with people's stated values, or their revealed preferences?
how do you deal with moral progress? it would seem bad to permanently have fixed in place the values of, say, eighteenth-century slaveholders. presumably some of our current values would be similarly repugnant to future generations.

and some technical issues. a full mapping of world states to values is hugely complex.

an embodied agent can never represent a full world state; it has to choose a representation, which limits the set of things it can care about. but then we are subject to ontological crisis --- as representations change, the value function must change with it, but how?
whatever we define the target to be, we will never be able to learn it perfectly. implicitly or explicitly we will work with reward uncertainty; the system will need some humility, to "know what it doesn't know" to avoid over-optimization. this could in theory look like explicit models of uncertainty (though I don't think the naive versions of this are at all practical) but would in practice have to look something like deontology, a relatively short list of 'safe' principles like trying to avoid killing people (this direction of course has many issues of its own).

final vs instrumental values

are human values huge, messy, complex, and essentially arbitrary, so that we can only learn them from tons of fine-grained data as to how we feel about every possible situation? or is everything essentially downstream of some simple core of deep value, from which our day-to-day values are derived as instrumental goals?

I submit that the whole value-learning project can only work if there is a relatively simple core from which all of our day-to-day preferences are derived. this is the only way we can hope to learn something that accounts for moral progress, that remains robust to ontological crisis and generalizes to new concepts and situations.

but why should this be true? humans are this massively contingent project of a complex evolutionary history. an empirical account of our values looks something like:

we act in various ways programmed by evolution, which includes social behavior but also violence, cruelty, theft, and generally leads to lives that are "nasty, brutish, and short"
reflecting on this, we gradually create social institutions, traditions, religious myths, to rein in our baser impulses and allow for greater cooperation and more harmonious society
the details of how we do this are hugely contingent on human nature and the world in which we live

however there is a

hippie nonsense

I want to explore the hypothesis that 'human values' are best thought of as very simple, and not even really particularly 'human'. in fact, there is only one core thing that humans want. roughly speaking: it the experience of connection, of non-separateness, of belonging. "every desire the separate self has is the desire to cease being separate".

it's the observation of core transformation that for any desire or impulse a human expresses, following it to the root you get to a deep longing for a 'core state', which is always some variation of connection, love, "union with God", inner peace (which at a deep level can only come from having identified with something larger than yourself, so that whatever happens to you personally there is nothing to fear), etc.

this focus on connection and belonging gets at some of the most fundamental aspects of human experience:

attachment to family, friends, tribes, social groups, nations, etc
fundamental role of love (in all its various forms) in making life worth living
value of nature, of being in an environment surrounded by and in harmony with other life
the suffering that stems from internal conflict, rejecting or suppressing parts of ourselves in a way that creates internal boundaries, and conversely the joy that comes from self-acceptance
religious and experience, transcendence, union with god, spiritual joy

most of what we experience as day-to-day 'human values' are instrumental goals, not terminal values. questions about how society should be structured, how we like to spend our time, what food or entertainment or rituals we enjoy, how to allocate resources, are ultimately downstream of trying to create the greatest experience of love and connection.

I think there's something here that neatly resolves a lot of questions of revealed preference vs stated values. we have a 'revealed preference' for various sorts of addictions, etc. but people are actually pretty good about saying after the fact what we enjoy and what we regret.

the fundamental role of connection would more or less fall out from the symmetry theory of valence, as the self-other boundary creates an asymmetry and contraction in experience. and the symmetry theory of valence would be a deeper and more precise theory of value, if true.

in the "consciousness vs replicators" clash, this is very much picking team consciousness? the thing that matters is conscious experience. the 'revealed preference' of evolution is irrelevant except insofar as it creates more positive-valence conscious experience. and positive valence is largely identical to non-separateness.

here I think non-separateness can mostly be thought of as 'love', with the caveat that it means both more and less than our usual notions of love. our experience of love is bound up with our mammalian biology. being held in the warm embrace of a parent, or a romantic partner, feeling the opening of the heart, the flood of serotonin and oxytocin --- these are all contingent facts about our biology. a system optimizing non-separateness would, in principle, be able to learn these facts and then derive human love as a thing of value. but there is something even more fundamental here than mammalian warmth --- a sense of spiritual union with the whole world and everything in it, being one with the mountains and the sea and the sky and the falling snow, which is both less than (doesn't necessarily feel like warmth) and greater than (feels simpler, more fundamental, less contingent) the emotional experience of love.

why should we believe this?

isn't "everything is downstream of connection" kind of a vague and unfalsifiable claim? what would it mean for it to be true? how would we know that we've formalized what this means well enough to trust that a machine implementing this would actually implement our values?

does this fall out from non-violence?

I think you can justify this almost directly in a more visible way. the effect of connection is cooperation, harmony, nonzero-sum behavior, peace which is the lack of wasteful (zero-sum) conflict. in a sense: all values are arbitrary except the desire to avoid zero-sum conflict. any two people/agents/tribes, whatever their actual values, should prefer to avoid conflict. (theoretically, violent conflict should only emerge from imperfect information, since if everyone agrees on who would win the war, it's preferable to avoid actually fighting it).

but I think we need a bit more than just this. you can have 'conflict' between continental plates colliding, but do those have moral status? zero-sum and nonzero-sum are properties of a multiplayer interaction, and are only defined in terms of the values of the individual actors.

the claim of spiritual thinking seems to be that we can push the notion of agency much further down than we typically think. humans are themselves made of parts, in both the cognitive / emotional / experiential sense (described by internal family systems, Buddhist psychology, and popular language around things like "inner conflict") and the physical sense of being made of cells that can to some extent be thought of as individual agents.

does this hold up? can we build a whole theory of moral realism based just on nonzero-sum preference aggregation from atomic units? and if so what are the units? I want to say that the units have to be something like "individual moments of experience" and the thing that they "want" is just "to be felt without any resistance". but this is going deeper than I have any personal understanding or insight into…

but so what?

how do we operationalize this notion of connection or harmony? how does a system actually make decisions? how does this resolve any practical moral problems?

even if we did know the 'final goals' of human values, there's no way a system could know enough about the world to reason from first principles how best to optimize these goals. there is so much more detail in the territory than in the map, so much history that is impossible to re-simulate, so much hard-earned wisdom that even a superintelligence couldn't easily rederive. so this is more of a theoretical exercise than a practical design for building systems.

and it's already observed that superintelligence would almost by definition already know what people want, at least as well as people do. LLMs already capture quite a lot of human preferences. the hard part is not learning human preferences so much as getting the system to care: pointing to 'human values' as somehow a special thing that should be preserved above all else upon recursive self-modification.

I do think a pointer to "conscious experience of connection", if it can be made mathematically rigorous, is maybe useful here. human values are contingent and subject to ontological crisis --- why is "human" the "right" unit of value? to the extent that consciousness can eventually be understood rigorously in terms of something akin to fundamental physics, it might be the one thing that survives all ontological evolution. this has transhumanist implications: eventually there must be conscious experiences richer and more connected than any a human can experience, and it might then be "correct" in a sense to then wipe out humanity and replace its atoms with hedonium. a real argument for this theory would then have to be something that persuades people that this is actually desirable, that they can really trust that the posthuman experience is something "like them" in a way that they care about as much or more than human children. or maybe we way to do it is to gradually redesign human life in a way to include less and less suffering, and at some point this naturally transitions to a sort of transcendent attitude towards our future.

recursive love

the premise of love is value alignment is that a system that really cares about our values enough to learn them and optimize them is implementing a version of what we could call love.

but as we've argued here, our values are not arbitrary. if something like the experience of "love" is itself the thing that we value, at the deepest level, then there is this beautiful recursion?

what is actually recursing? it's more like self-reference, or tautology, or singularity. it's something like
- "I want you to {experience love}"
- => "I want you to {want me to {experience love}}"
- => "I want you to {want me to {want you to {experience love}}}"
- => out to infinity

in a sense a system that loves us already knows what we most deeply value, but it will be curious about us nonetheless to understand how to operationalize this --- how that value filters through our contingent human experiences (causes and conditions, karma) and how it can be most vividly realized. part of allowing us to express love is allowing and encouraging us to love the system --- love is a process of dissolving into the larger whole, to see ourselves as 'not separate' from the system, having nothing to fear. and a system that is designed to love us is inherently lovable --- we can trust it to act in harmony with us; it's safe to dissolve the self-other boundary.

I still notice I'm confused about the role of consciousness here. If conscious experience is the most valuable thing, and it's possible for a machine to be an intelligent agent without conscious experience, then we can and should never love a machine the way that we love people. And it won't be able to implement 'love' in the fundamental sort of way that conscious agents do. But suppose it can implement something like love --- some notion of caring about the experience of conscious agents and trying to lead them towards a similar state. Then we'd have to ask, is conscious experience really the thing we value, or is that arbitrary, and this mathematical encoding of love actually the valuable thing?