What if partially-observable reinforcement learning is just next-token prediction wearing a different costume? A look inside a benchmark — and the structural rhyme it shares with how we train language.
Both have to make a decision under uncertainty, with only a sliver of the world available to them, by leaning on what came before. POPGym is a benchmark designed to stress-test exactly this — the memory and inference machinery an agent uses when its observations hide the truth. Strip away the surface, and the engineering problem it poses looks remarkably like the one a language model solves a trillion times during training.
The rest of this note walks through how POPGym is set up, then maps each piece — observation, action, reward, episode, model — onto the corresponding gear in language model training. Use the toggle below to flip any concept between its RL framing and its language framing.
Each card holds one piece of the machine. Click the toggle to switch how it's named. Same gear, different costume — and that is the whole argument.
POPGym is a collection of 15 partially-observable environments, each with three difficulty levels (Easy, Medium, Hard). An agent is dropped into one, given a sliver of observation at each step, and asked to act. The environment hides things on purpose — sometimes a card you flipped 30 steps ago, sometimes the position of a hidden ship, sometimes the next term in a sequence you've been watching scroll past.
The environments are deliberately lightweight — small observation vectors, fast simulation, episodes that fit in a few thousand steps. The whole point is that you can train memory architectures on a consumer GPU in an afternoon and actually compare them. Below: pick any environment to see what it asks of the agent, and what that task looks like through a language-modeling lens.
The environment shows the agent a token at each step. The agent's job is to output the token it saw k steps ago. That's it. It looks trivial, but it forces the model to maintain a sliding window of past observations with perfect fidelity — exactly the failure mode that breaks weak memory architectures.
The memory card game. Cards are face-down; the agent flips one, then another, and gets a reward if they match. To play well it has to remember every card it has previously seen and where it was, sometimes dozens of steps back.
The agent watches a sequence of tokens stream past. At a signal, the stream stops, and now the agent has to output that same sequence back, in order. The entire input must be compressed into the recurrent state, then decoded.
A grid hides ships. The agent fires at coordinates and gets hit/miss feedback. To play well it must build an internal belief over where ships might be, update that belief with each shot, and choose its next shot to be maximally informative.
Once you line them up, the parallels stop feeling metaphorical and start feeling structural. The objects being optimised, the gradient signals, the architectural choices — they rhyme to an uncomfortable degree.
| Concept | POPGym (RL) | Language model training |
|---|---|---|
| Input | Observation vector oₜ at each step | Token embedding xₜ at each position |
| Context | Sequence of past observations o₀..oₜ | Sequence of past tokens x₀..xₜ |
| Hidden state | Recurrent/attentional memory hₜ | Recurrent/attentional memory hₜ — same architectures |
| Output | Action distribution π(aₜ | h≤ₜ) | Next-token distribution p(xₜ₊₁ | h≤ₜ) |
| Supervision | Scalar reward, possibly delayed | Cross-entropy against the true next token |
| Optimised quantity | Expected discounted return | Expected log-likelihood |
| Episode | One run from start state to terminal | One document / one packed sequence |
| Architecture | LSTM, GRU, Transformer, S4, Mamba, ... | LSTM, GRU, Transformer, S4, Mamba, ... |
| Failure mode | Forgetting useful past observations | Forgetting useful past tokens |
The architectures column is the giveaway. POPGym's paper benchmarks the same family of sequence models — recurrent, attentional, state-space — that show up in every modern language model. They aren't borrowing the tools metaphorically; they're literally the same code, with a different loss function bolted on the end.
Being honest about the disanalogies is the whole point of taking the analogy seriously. Three places where they come apart:
A language model gets a dense, immediate, low-variance signal: every position carries a gradient. A POPGym agent might play a full episode and receive a single scalar at the end. The information per step is orders of magnitude lower, which is why RL is so much harder to train stably.
The language model sees a fixed corpus. The POPGym agent's training distribution is its own policy — which is changing. The data it learns from is shaped by the model it's trying to learn. There is no equivalent feedback loop in standard pre-training.
A token is descriptive: it predicts the world. An action is causal: it changes the world. The agent's output is fed back into the system that produces its next input, so the choices it makes shape its own future observations. This is closer to an autoregressive language model generating its own continuation than to one being trained on text.
Language modelling optimises log-likelihood of a fixed target. RL optimises expected return under a reward function that someone designed. The choice of reward is itself an inductive bias, and a brittle one — a fact RLHF practitioners know painfully well.
The interesting move — and the one worth working out carefully — is point ③. The closest language-modelling analogue to an RL agent isn't a pre-trained LM at all. It's an LM generating, where each emitted token becomes part of the prompt for the next. That's a closed loop, like RL, and it's where the disanalogies start to dissolve. RLHF and chain-of-thought aren't accidents — they're exactly the regime where language modelling and RL collapse into the same problem.
Strip the RL costume away and what POPGym actually measures is whether a sequence model can compress a history into a state that supports future prediction. That's a question that doesn't belong to RL or to language — it belongs to sequence modelling itself. The benchmark is useful precisely because it isolates that question from the noise of visual perception or natural language.
So if you're working through the hypothesis that reinforcement learning and language model training are doing the same thing — POPGym is one of the cleanest places to actually test it. Same architectures, comparable failure modes, and the supervision signal is the variable you can change. It's the experimental setup that isolates exactly the disanalogy that matters: dense vs. sparse, descriptive vs. causal, fixed corpus vs. self-generated.
The work, then, is to take a memory architecture that succeeds on a POPGym task, take its language-modelling cousin, and ask: is the representation it learns the same kind of thing? That's a real experiment, not a metaphor.