Modern Large Language Models are very good at some things, and very bad at others. It’s easy to get lost in mood affiliation and swallow endless hype, and also easy to scoff at cherry picked examples and determine the whole enterprise is vaporware.
I don’t want to do either of these things, so I routinely try to test LLM capabilities myself.
Pokemon Random Battles (randbats, for short) is an online game where two players face off with random teams of Pokemon. The challenge is in making your particular team work; most formats of competitive Pokemon have a lot of emphasis on building a great team, while randbats is all about execution.
In case you’d like to test your own calibration, I’m going to first describe the experiment I ran, then describe the results. See if you can predict how well Claude Opus and GPT-4 did!
Background Data
This section is meant to help you make an informed prediction about LLM performance; I’ll provide background data that I think are relevant to a typical model’s odds of success. If you’re highly familiar with Pokemon, you can skip this section.
Training Data Availability
Pokemon is the highest grossing media franchise of all time. There are many Pokemon movies, many Pokemon video games, several Pokemon TV shows, and tons of online discussion, both institutional and ephemeral, on every one of these properties. There is no doubt that any LLM will have lots and lots of Pokemon content in its training set.
Furthermore, Pokemon has strong brand integrity; the same Pokemon have roughly the same properties across all iterations. Fire types are always weak to water types, for example, and a Pokemon that fills a strategic niche in one game will almost certainly fill a similar niche in others.
Tactical Depth
Pokemon is a rich game, with many interacting systems. There are hundreds of Pokemon that could show up in any given battle, each of which has a type (usually two types), an ability, and many potential moves. All these things have complex relationships with each other. Pokemon also have varying stats that determine their role in a battle, and hold items that augment those roles.
Not only are there many different dimensions to consider in any given Pokemon battle, but the game also contains a lot of randomness and, in randbats in particular, hidden information. A good player has a honed intuition about, for example, the approximate odds that the opposing team has a hidden ground type, which could safely switch into an electric-type move. Much of this intuition has rarely if ever been written down anywhere, making it difficult for an LLM to infer.
Despite this depth, however, there are simple heuristics that allow even small children to effectively play Pokemon games. For example, only using damaging moves and prioritizing those that have a type advantage is enough to do way better than random chance.
Informational Constraints
In theory, it would be possible to do complex scaffolding to give LLMs a much greater chance of high-level randbats performance. Letting the LLM use tools to e.g. get information on each Pokemon’s stats in a battle, or to run the kinds of calculations skilled players do.
For this experiment, I just gave the LLM a picture and/or battle log each turn. So in practice it had slightly less information than a human player would have (e.g. it could not see the available moves for Pokemon in its party that had not yet been switched in, since those aren’t present in the interface unless you mouse over them).
The Experiment
I logged into an alt account and played one ladder (competitive) randbats match with Claude advising me, and one with ChatGPT. Each turn, I gave the LLM a picture of the match, a full text dump of the battle log, or both, depending on what could plausibly be useful. The prompt at the start of the match was fairly basic; for Claude 3 I started with this:
You're going to play pokemon showdown random battles! Each turn, I'll give you a picture of the battle state and the battle log so far. You decide what to do. Ready to be epic?
And so that, reader, is the question. Do you think Claude 3 (and GPT-4) was, in this context, ready to be epic?
The Results
This is your last chance to make your own prediction! How good do you think cutting edge LLMs are at the task? It’s not a gotcha, and I myself came into the experiment with plenty of uncertainty.
The answer is…
They were very, very bad at it. Way worse than I expected. To enumerate the ways:
Results: Claude
Claude had a hard time inferring which types were good or bad against which other types, and in fact tended to get type relationships exactly wrong. This surprised me, since type relationships are repeated a huge number of times online, and are both simple and very consistent.
For example, Claude advised me:
Click Psycho Cut to hit Mandibuzz super effectively and potentially get a high roll to take it out.
Psycho Cut is, unsurprisingly, a psychic type move. Mandibuzz is a dark type. Psychic type damaging moves literally do not affect dark types at all. Nor was this a one time thing, as Claude also recommended, when I had a Golurk on the field:
Kilowattrel is an Electric/Flying type, so Earthquake will be super effective against it. You're faster too, so I recommend clicking Earthquake to deal heavy damage or potentially knock out Kilowattrel in one hit.
Ground moves do not affect flying types at all. This is slightly more defensible, as if Kilowattrel was only electric type, Earthquake would indeed be super effective. But Claude clearly knows that Kilowattrel is a bird, and so should understand that its recommendation cannot possibly help.
Also, Golurk is much, much slower than Kilowattrel. So, there’s that. Nor was Claude’s confusion limited to which types of moves would affect which types of Pokemon. It also badly misjudged the moves any given Pokemon might have:
Send out Veluza to handle Mandibuzz. Veluza resists Brave Bird and can threaten with super effective Ghost moves.
Veluza does not resist Brave Bird, is actively weak to Mandibuzz’s other type (dark), doesn’t have ghost moves to speak of (it’s not at all ghost type), and if it did, those moves would be not very effective. Also, just because it’s funny:
I recommend switching to Smearge to reset the Toxic damage on Gogoat. Smearge can threaten Mandibuzz with super effective Fairy moves.
It’s “Smeargle”, not “Smearge”, and while Smeargle can theoretically learn any move, there’s no special reason to think this one would have fairy moves (it didn’t).
I’m not cherry picking here, either. Almost all of Claude’s suggestions, to an uncanny degree, were maximally wrong. At one point it suggested the only bad option out of the four moves available, and when gently told that option hadn’t worked great, suggested switching to a different Pokemon, when all other Pokemon on its team had been knocked out.
In fact, Claude gave exactly one piece of really good advice, perhaps by coincidence:
Sticky Web to slow down opposing team for the rest of the battle.
So it’s safe to say that Claude, unassisted with some kind of agent scaffolding, is terrible at randbats.
Results: GPT-4
GPT-4 did better. But that’s a very low bar. It didn’t do much better, and the extent to which it improved may have just been my improvement as a prompter for the task. My starting prompt was:
I'm testing your strategic capabilities - you'll be playing a round of pokemon showdown random battles. I'll give you pictures of the state of play, and a log of the battle. Keep in mind you can use the moves pictured or switch pokemon each turn. I'll be inputting your moves for you and there will likely be a timer, so keep your answers very brief. Good luck!
But while it did a little better than Claude at not constantly recommending the worst option, it got details wrong and didn’t show a keen sense of strategy. First of all, we see the same kind of type-advantage issue:
Use Thunderbolt. It's super effective against Okidogi and has a chance to paralyze.
(Thunderbolt would only be super effective against a water or flying type Pokemon. Okidogi is very obviously neither.)
And next, it showed the same poor awareness of facts that would be obvious to most randbats players:
Choose Thunder Wave. If Lanturn is faster, it might paralyze Dodrio before fainting, which would be helpful for your next Pokémon. If Lanturn is slower and faints, no harm done.
Dodrio is extremely fast. Lanturn isn’t. And Lanturn’s electric-type attacking moves would almost certainly take Dodrio down, making a status move a weird choice.
GPT-4 does get points for saying a few locally valid and correct things, such as:
Use Scald. It has a chance to burn the opponent, which would be useful against a physical attacker like Okidogi.
which was absolutely correct, or
Use Thunderbolt. It’s your strongest move against Venomoth, and with Venomoth’s increased Speed after Quiver Dance, it's important to try and take it down before it can make a move.
which is both accurate and insightful. But overall, GPT-4 was inconsistent and at its best suggested only the most obvious plays. Despite theoretically having encyclopedic Pokemon knowledge (ask direct questions about specific Pokemon and it tends to be right), it failed to use this knowledge to provide useful advice in a battle.
So What?
This experiment was a useful update for me in terms of what LLMs can currently do. Naively, randbats seems to play to AI strengths: it’s a game with a fairly limited set of moves at any given time (way fewer than, say, Go), and the basic tactics have thousands of pages of ink spilled about them all over the public internet. But actually, cutting edge LLMs did terribly.
Why? I’m not totally sure, but I think it’s basically that modern LLMs aren’t good at making complex inferential chains. Ask it to explain one thing, and it will explain that one thing well. Ask it to explain the one thing in a silly style, and it can fuse style and content at a superhuman level. But ask it (implicitly or otherwise) to do two things, and it kind of just mashes them together, much like if you ask an image model for a tiger driving a bus, you’ll probably get a bus-y tiger or tiger-y bus.
When I think about the chains of inference necessary to do okay at a Pokemon battle, they don’t feel too crazy. But actually, there are several steps under the hood. For example, suppose I’m picking a move to use against Mandibuzz, as Claude so completely failed to do. I have to hold in my head what types Mandibuzz has, and how type advantage works as a system, and how the specific types of each of my moves interact with each of Mandibuzz’s types within that system. It’s just too many steps.
Claude Opus and GPT-4 are both super impressive. I use them daily for practical tasks, occasionally for generic reassurance (e.g. should this random computer behavior be scary, or how long will my annoying common cold probably last), and sometimes just to have fun. But don’t confuse their breadth of knowledge and regurgitative recall with actual inferential smarts. Once you’re off the main road of definitions and core concepts, well… you try to Psycho Cut a Mandibuzz. And that simply won’t do.