How Can We Possibly Align AI When We Aren't Even Aligned with Each Other?

I’m genuinely confused…

Mar 13, 2023

Apropos of Bing’s creepy manipulations and romantic overtures, Scott Alexander writes:

[T]he people who want less racist AI now, and the people who want to not be killed by murderbots in twenty years, need to get on the same side right away. The problem isn’t that we have so many great AI alignment solutions that we should squabble over who gets to implement theirs first. The problem is that the world’s leading AI companies do not know how to control their AIs.

(For the uninitiated: AI alignment is the effort to create artificial intelligence that respects—or, less anthropomorphically, is aligned with—human interests and values. A misaligned AI could be evil, in the cartoonish sci-fi sense; but it could also simply be misapplying reasonable-sounding rules. Nick Bostrom famously illustrated this idea with a hypothetical paperclip maximizer that, in an effort to convert all available matter into paperclips, destroys human civilization.)

So here’s my question: What are human interests and values? Seriously! Like, is there anything close to a consensus with which AI can be aligned? Leaving aside the technical challenges (assuming that we could get AI to do exactly what we want), it’s not at all clear what goals “we” ought to have it adopt. We can’t even seem to agree on all our elections or national borders or whether a fetus is a person!

Admittedly, in my original vision for this post, I had in mind a hot take disguised as a rhetorical question. But my basic point—about the suspect premise on which the alignment problem is built—quickly devolved into a series of earnest wonderings and speculation, like…

Why, if AI alignment is such a big deal, is non-alignment among humans not a cause for greater concern?
(Alternatively: Maybe we’re all obsessed with human non-alignment, and we just learn to live with the infighting because there’s nothing we can do about it?)
Or maybe it’s that non-aligned humans are less scary because we understand them on a gut level—whereas AI is an alien entity that we fear might actually possess the preternatural combination of power and depravity required to wipe our species out of existence?
Is “not wiping our species out of existence” (i.e., the interest/value we can all agree on) what everyone is secretly thinking when they say “AI alignment”?
Are the subjective things we can’t all agree on—faith, culture, politics, etc.—necessarily also issues that pose less of a threat to humanity?
Why does Bing use so many emojis?

I’m sure that smart and thoughtful people have devoted considerable attention to these questions, and I’d love to hear what they have to say. But common sense alone reveals that AI alignment is more than a technical problem; it’s a psychological, philosophical, and political one.

And do problems of that variety that ever really get solved once and for all? It seems to me far more likely that the “solution” to the alignment problem—like those for mental health, relationships, or democratic institutions—is in fact just continuous monitoring and maintenance; human interests and values are not static. We already see evidence, for example, that simply giving ChatGPT a list of rules is unlikely to prove effective:

I assume that the rules which led ChatGPT to produce this response sound fairly reasonable on paper—but of course this answer is a silly overgeneralization of those principles.

Yet getting ChatGPT to actually understand nuance and context seems like a tall order. Or is what I, a human, experience as “nuance and context” really just the result of years and years of micro-corrections, applications, and exceptions to general rules—an ongoing negotiation between principles like “truth” and “freedom” and “don’t hurt people”? (Again, these are genuine questions!)

In a recently republished essay on morality, Joyce Carol Oates describes her repeated failure as a child to understand why a rooster continued to attack her—no matter how kindly she treated it:

As a little girl I ran crying to my grandmother to ask: Why does Mr. Rooster hate me?—and my grandmother told me in her heavily accented English, Don’t be silly. The rooster doesn’t hate you, he is just a rooster. Always you should know, that is what roosters do. My grandmother lacked the knowledge to have explained the rooster is hardwired to blindly attack, it is his instinct. She could not have said To allow us to behave other than instinctively, like a brute creature—that is the purpose of education, of morality, of civilization.

I wish I knew whether AI were more like a rooster or a person with respect to alignability. I worry that hard-coding instincts into AI is a fool’s errand—a distraction from what we should really be trying to do: Make it understand our ethics, in all their subtlety and contradictions.

And I worry AI is incapable of such understanding; that we are impossible to understand.

Thanks to Charlie for looking at a draft of this post.

The Elbow

Discussion about this post