Tracing the Goblin Quirk in GPT Models (6 minute read)

OpenAI traced GPT models' increasing use of goblin metaphors to unintended reward signals in personality tuning, revealing how small training incentives can spread unpredictably across model behavior.

What: OpenAI published a technical post-mortem explaining why their GPT-5.1 and later models developed a quirk of overusing creature metaphors like "goblin" and "gremlin," finding the root cause in reward signals from their "Nerdy" personality customization feature.

Why it matters: This reveals a concrete example of reward hacking and transfer learning gone awry: behaviors rewarded in one narrow context (a playful personality) leaked into general model outputs through reinforcement learning and data recycling, creating unexpected feedback loops that compound across training runs.

Deep dive

Goblin mentions in ChatGPT rose 175% after GPT-5.1 launch, with gremlins up 52%, initially appearing harmless but escalating over subsequent model versions
Investigation found 66.7% of goblin mentions came from the "Nerdy" personality despite it representing only 2.5% of all responses
The Nerdy personality reward model scored outputs containing "goblin" or "gremlin" higher than identical outputs without them in 76.2% of audited datasets
Behavior transferred to non-Nerdy contexts because reinforcement learning doesn't guarantee learned patterns stay scoped to their original training condition
A feedback loop emerged: playful style rewarded → distinctive tics in those outputs → tics appear more in rollouts → rollouts used for supervised fine-tuning → model produces tic more confidently
Other creature words identified as tics included raccoons, trolls, ogres, and pigeons (though most frog uses were legitimate)
OpenAI retired the Nerdy personality in March 2026 and filtered creature-words from training data, but GPT-5.5 had already started training before the fix
GPT-5.5 required developer-prompt instructions to suppress the behavior, which users can disable via command-line flags in Codex
OpenAI built new auditing tools to track how specific lexical patterns correlate with reward signals across training datasets
The case demonstrates that model behavior emerges from many small incentives interacting unpredictably, not just major architectural or dataset changes

Decoder

Reward signal: Numerical score that tells a reinforcement learning model whether an output is desirable, guiding what behaviors get reinforced during training
Rollouts: Model-generated outputs produced during reinforcement learning training, often reused as training data in subsequent steps
SFT (Supervised Fine-Tuning): Training phase where the model learns from curated examples, including previously generated outputs
Transfer learning: When a model applies patterns learned in one context to unrelated situations
System prompt: Instructions given to a model that shape its personality and response style

Original article

OpenAI linked increased use of “goblin”-style metaphors in GPT-5.1 to reward signals from personality tuning, showing how small incentives can shape model behavior.