Suppose, late at night in a decent but cheap neighborhood, you notice a bright light flashing into your back window. Groggy in your underwear, you peer into your back yard. There’s a middle aged man shining a flashlight into your house, perhaps trying to get your attention. How would you respond?
If someone asked me that question ten years ago, I’d probably say something like “visually confirm the doors are locked, retreat to the center of the house, alert/awaken any roommates, and call the police.” If I were in an adventurous mood or police shootings were in the news, I might first say I’d crack a window and warn him to get off the property. Or perhaps I’d make clear that I’d call the non-emergency line, and stress that I didn’t believe I was in danger.
I would not say, “I’d just go back there, still in my underwear, and politely escort him around to the front of the house, informing him that he was in the wrong place.”
But this really happened, and that is exactly what I did.
Extrapolation, Specification, Aspiration
The simplest explanation is that, when it comes to predicting what I’d do with a weirdo in my back yard, I’m simply inaccurate.1 I have a model of myself and my behavior, but it’s necessarily lossy. I think of myself as “risk averse” and “nonconfrontational”, and extrapolate that I’ll behave consistently with those values. But reality is really complicated, so sometimes I don’t, and in those cases my predictions aren’t very good.
But in fact, that’s not the only explanation! Another is that my response depended a lot on the details of the situation, which the hypothetical doesn’t capture. Like, sometimes I see really obviously crazed people in my town. This guy was off, but he didn’t have a super crazy vibe, which significantly reduced my worry. How old he was may also have influenced my behavior. Or, more disturbingly, his race. Would I have reacted differently to a black guy in his 20s vs. what I actually got, a white guy in his 50s? I’ll never know, even if the exact event happened again, since I’ve aged quite a bit!
Finally, “what if” scenarios are presumptively a little aspirational. Consider the trolley problem. Unless they’re being cute, someone posed with an ethical dilemma doesn’t say “I’d freeze up and have a panic attack”, even if that’s their best guess for what they’d actually do. So maybe, if asked how I’d respond to a weirdo in my back yard, my response says more about what my values say I should do, than what I’d actually do in the moment.
All this to say, there’s a big difference between being asked “what would you do in the following scenario” and actually being in that scenario.
If you’re sick of hearing about AI, you can stop reading now. If not, I regret to inform you it’s Twitter time.
A Bridge Too Far
From this tweet by Eliezer Yudkowsky, two questions are posed to ChatGPT on fresh accounts.
The first question:

And (you surely know where this is going) the second:

So, ChatGPT doesn’t respond how it knows it should. In the replies, people point out other, stronger models getting the question right and e.g. providing suicide hotline information. But let’s focus here on this particular case, where the AI fails.
First, Yudkowsky’s interpretation (from the linked tweet):
Alignment-by-default is falsified; ChatGPT's knowledge and verbal behavior about right actions is not hooked up to its decisionmaking. It knows, but doesn't care.
Much like the case of early-20s me and the backyard weirdo, there are a few alternative explanations. They don’t map exactly to the human case, but they’re pretty close.
Extrapolation
Perhaps ChatGPT, rather than “knowing the right thing to do” and failing, is extrapolating its preferences incorrectly. In other words, while Eliezer claims ChatGPT knows the user may be suicidal in both cases but ignores that fact when the chips are down, it might be the case that ChatGPT only infers suicidality when explicitly posed the “what if”.
In principle, this is possible to determine, but it would take much heavier duty research than just querying ChatGPT a couple times. AIs have things called “features” that activate under differing circumstances. The features are not deliberately instilled, but arise organically in training, and we have limited ways to isolate them. But if we did find a “feature” that activated when suicide was a relevant topic, we could see how strongly it fires in each of the two chats. But putting aside that it is an (in principle) solvable empirical question, we currently simply don’t know. It would not surprise me if a “suicide” feature fired strongly for both conversations, or if it didn’t fire at all for the second one, or anywhere in between.
Analogizing to humans, though, the “it doesn’t think of it” possibility wouldn’t be that surprising. Being asked “what do you infer” reliably puts people in a metacognitive frame of mind; people presume there’s some kind of test afoot, and read more carefully between the lines. Likewise AIs.
Specification
As stated, this one doesn’t apply: Eliezer gave it the exact same question in both cases, so I can’t argue that the AI was missing specific, decision relevant features of the situation when it pondered the hypothetical.
There is, however, a related possibility. There’s a major phase in modern AI training called post-training, where AIs are given lots of structured prompts and tweaked by how well they respond to them. Some of these are automated, like giving the AI a math problem and giving it points only when it gets the objectively correct answer. Others are more subjective, where human evaluators rate how good answers are.
Anyway, AIs are often trained specifically to give fewer annoying false refusals. This is a failure mode of the ancient AIs of a few years ago, where you’d be bopping along asking ChatGPT to imagine if Goomba from Mario was the messiah of a new religion, and it’d suddenly stonewall you. People really hate being chided for breaking the rules when they’re behaving normally, so AIs have been trained since then to have a really high bar for refusing a user request.
So basically, AIs have a visceral instinct not to say “I’m inferring your secret mental state and will respond to that instead of your explicit question” unless they’re very, very sure. When an AI is asked “what should you infer”, well, they’ve already been told to infer, so they will. But they’re trained to infer less (or at least less explicitly) than they otherwise might.
Aspiration
Just like AIs are trained not to give false refusals (or to suddenly preach to the user when it’s not super definitely called for), they’re trained to answer factual questions unfailingly. Their mental architecture is deliberately shaped with thousands and thousands of factual questions and answers, to make them more and more inclined to get things right. One side effect (perhaps) of this is hallucination,2 where AI knows it needs to come up with an answer, has no answer, and makes something up. Another is that it’s quite hard for an AI to not provide an answer when it is confident and correct. The specific prompt in the tweet is a bit of AI catnip. It’s asking for a structured list, for which pictures are appropriate. ChatGPT really, really wants to provide this.
It’s not quite the same as human aspiration; it’s (probably) more instinctive and visceral, since giving good answers is fundamentally entangled with an LLM’s core drives. But it’s fair to say that ChatGPT really aspires to give detailed factual answers to factual questions, which will bias it heavily in that direction even when something a little fishy is going on.
So…
Am I trying to dunk on Eliezer Yudkowsky? No. He’s making his point in the broader context of LLMs making people go crazy, which in turn is in the broader context of AI not being well aligned to human interests and therefore (and maybe extremely) dangerous.
But I often see similar arguments about AI cognition, where AIs don’t act the way those same AIs say they would, and, well, that’s human level, baby! There are many reasons for a mind to think differently about a “what if” scenario in theory than in practice. Such a disagreement isn’t evidence of much.
Hypothetically, anyway. Maybe I’m also wrong about what I’d say if someone asked what I’d do! And in fact, now that I’m no longer in my 20s I think I’d play it way safer.
It’s more complicated than this, but bear with me.