Vol. I · Note No. 003 A working note · An interactive reading Reinforcement & Language
An interactive explainer

POPGym, considered
as a · language model.

What if partially-observable reinforcement learning is just next-token prediction wearing a different costume? A look inside a benchmark — and the structural rhyme it shares with how we train language.

§ 01 — Premise

An agent that cannot see needs the same thing a model that has not read ahead needs.

Both have to make a decision under uncertainty, with only a sliver of the world available to them, by leaning on what came before. POPGym is a benchmark designed to stress-test exactly this — the memory and inference machinery an agent uses when its observations hide the truth. Strip away the surface, and the engineering problem it poses looks remarkably like the one a language model solves a trillion times during training.

The rest of this note walks through how POPGym is set up, then maps each piece — observation, action, reward, episode, model — onto the corresponding gear in language model training. Use the toggle below to flip any concept between its RL framing and its language framing.

§ 02 — The core mapping

Six gears, two costumes.

Each card holds one piece of the machine. Click the toggle to switch how it's named. Same gear, different costume — and that is the whole argument.

§ 03 — Anatomy of POPGym

How a single POPGym episode actually runs.

POPGym is a collection of 15 partially-observable environments, each with three difficulty levels (Easy, Medium, Hard). An agent is dropped into one, given a sliver of observation at each step, and asked to act. The environment hides things on purpose — sometimes a card you flipped 30 steps ago, sometimes the position of a hidden ship, sometimes the next term in a sequence you've been watching scroll past.

ENVIRONMENT hidden state s₀, s₁, s₂, ... sₜ (you don't get to see this) AGENT (w/ memory) π(aₜ | o≤ₜ) policy conditioned on history (this is the part that learns) observation oₜ · reward rₜ action aₜ repeat for an entire episode — the agent's only window onto the world is the sequence o₀, o₁, ..., oₜ at episode end, the reward signal is used to update the policy
The POPGym loop. The hidden state on the left is the whole point: the agent must infer what it cannot see, from the trail of observations it has seen so far.

The environments are deliberately lightweight — small observation vectors, fast simulation, episodes that fit in a few thousand steps. The whole point is that you can train memory architectures on a consumer GPU in an afternoon and actually compare them. Below: pick any environment to see what it asks of the agent, and what that task looks like through a language-modeling lens.

§ 04 — A guided tour of four environments

Four tasks. Four familiar training signals.

Repeat Previous

k-step delay · sequence recall

The environment shows the agent a token at each step. The agent's job is to output the token it saw k steps ago. That's it. It looks trivial, but it forces the model to maintain a sliding window of past observations with perfect fidelity — exactly the failure mode that breaks weak memory architectures.

In language terms This is a copying task. It's the same thing a transformer does when it has to attend back to an earlier token in the context window and reproduce it verbatim. Same mechanism — induction head, retrieval over a key-value cache — just dressed as an RL reward instead of a cross-entropy loss.

Concentration

card-matching · long-range memory

The memory card game. Cards are face-down; the agent flips one, then another, and gets a reward if they match. To play well it has to remember every card it has previously seen and where it was, sometimes dozens of steps back.

In language terms A long-context retrieval problem. The agent is being trained to encode arbitrary tokens (card identities) into a key-addressable memory and to read them out by location. This is the same capability that lets a language model answer questions about a fact mentioned 8,000 tokens earlier.

Autoencode

observe-then-reproduce · sequence compression

The agent watches a sequence of tokens stream past. At a signal, the stream stops, and now the agent has to output that same sequence back, in order. The entire input must be compressed into the recurrent state, then decoded.

In language terms This is the encoder-decoder bottleneck, plain and simple. It's what a seq2seq model does when it reads a sentence, collapses it into a hidden state, and reconstructs it. POPGym is asking: can your memory architecture serve as a competent autoencoder under RL supervision?

Battleship

hidden-board inference · active search

A grid hides ships. The agent fires at coordinates and gets hit/miss feedback. To play well it must build an internal belief over where ships might be, update that belief with each shot, and choose its next shot to be maximally informative.

In language terms This is exactly the kind of Bayesian-updating-in-latent-space that a chain-of-thought model does when it works through a problem: maintain a hypothesis, test it against new evidence, refine. The agent is learning to do inference in its hidden state, and the reward is shaping that hidden state the way log-likelihood shapes a language model's.
Both regimes are doing the same trick:
compressing a history into a representation
that makes the next token — or the next action — predictable. — The thesis, in one line
§ 05 — Side by side

The mechanical correspondence.

Once you line them up, the parallels stop feeling metaphorical and start feeling structural. The objects being optimised, the gradient signals, the architectural choices — they rhyme to an uncomfortable degree.

ConceptPOPGym (RL)Language model training
Input Observation vector oₜ at each step Token embedding xₜ at each position
Context Sequence of past observations o₀..oₜ Sequence of past tokens x₀..xₜ
Hidden state Recurrent/attentional memory hₜ Recurrent/attentional memory hₜ — same architectures
Output Action distribution π(aₜ | h≤ₜ) Next-token distribution p(xₜ₊₁ | h≤ₜ)
Supervision Scalar reward, possibly delayed Cross-entropy against the true next token
Optimised quantity Expected discounted return Expected log-likelihood
Episode One run from start state to terminal One document / one packed sequence
Architecture LSTM, GRU, Transformer, S4, Mamba, ... LSTM, GRU, Transformer, S4, Mamba, ...
Failure mode Forgetting useful past observations Forgetting useful past tokens

The architectures column is the giveaway. POPGym's paper benchmarks the same family of sequence models — recurrent, attentional, state-space — that show up in every modern language model. They aren't borrowing the tools metaphorically; they're literally the same code, with a different loss function bolted on the end.

§ 06 — But not identical

Where the analogy strains.

Being honest about the disanalogies is the whole point of taking the analogy seriously. Three places where they come apart:

① The supervision signal

A language model gets a dense, immediate, low-variance signal: every position carries a gradient. A POPGym agent might play a full episode and receive a single scalar at the end. The information per step is orders of magnitude lower, which is why RL is so much harder to train stably.

② The data distribution is non-stationary

The language model sees a fixed corpus. The POPGym agent's training distribution is its own policy — which is changing. The data it learns from is shaped by the model it's trying to learn. There is no equivalent feedback loop in standard pre-training.

③ Action ≠ token

A token is descriptive: it predicts the world. An action is causal: it changes the world. The agent's output is fed back into the system that produces its next input, so the choices it makes shape its own future observations. This is closer to an autoregressive language model generating its own continuation than to one being trained on text.

④ Reward shaping vs. likelihood

Language modelling optimises log-likelihood of a fixed target. RL optimises expected return under a reward function that someone designed. The choice of reward is itself an inductive bias, and a brittle one — a fact RLHF practitioners know painfully well.

The interesting move — and the one worth working out carefully — is point ③. The closest language-modelling analogue to an RL agent isn't a pre-trained LM at all. It's an LM generating, where each emitted token becomes part of the prompt for the next. That's a closed loop, like RL, and it's where the disanalogies start to dissolve. RLHF and chain-of-thought aren't accidents — they're exactly the regime where language modelling and RL collapse into the same problem.

§ 07 — Takeaway

If you squint, POPGym is a unit test for memory.

Strip the RL costume away and what POPGym actually measures is whether a sequence model can compress a history into a state that supports future prediction. That's a question that doesn't belong to RL or to language — it belongs to sequence modelling itself. The benchmark is useful precisely because it isolates that question from the noise of visual perception or natural language.

So if you're working through the hypothesis that reinforcement learning and language model training are doing the same thing — POPGym is one of the cleanest places to actually test it. Same architectures, comparable failure modes, and the supervision signal is the variable you can change. It's the experimental setup that isolates exactly the disanalogy that matters: dense vs. sparse, descriptive vs. causal, fixed corpus vs. self-generated.

The work, then, is to take a memory architecture that succeeds on a POPGym task, take its language-modelling cousin, and ask: is the representation it learns the same kind of thing? That's a real experiment, not a metaphor.