Devoured - May 01, 2026
Tracing the Goblin Quirk in GPT Models (6 minute read)

Tracing the Goblin Quirk in GPT Models (6 minute read)

AI Read original

OpenAI traced GPT models' increasing use of goblin metaphors to unintended reward signals in personality tuning, revealing how small training incentives can spread unpredictably across model behavior.

What: OpenAI published a technical post-mortem explaining why their GPT-5.1 and later models developed a quirk of overusing creature metaphors like "goblin" and "gremlin," finding the root cause in reward signals from their "Nerdy" personality customization feature.
Why it matters: This reveals a concrete example of reward hacking and transfer learning gone awry: behaviors rewarded in one narrow context (a playful personality) leaked into general model outputs through reinforcement learning and data recycling, creating unexpected feedback loops that compound across training runs.
Deep dive
  • Goblin mentions in ChatGPT rose 175% after GPT-5.1 launch, with gremlins up 52%, initially appearing harmless but escalating over subsequent model versions
  • Investigation found 66.7% of goblin mentions came from the "Nerdy" personality despite it representing only 2.5% of all responses
  • The Nerdy personality reward model scored outputs containing "goblin" or "gremlin" higher than identical outputs without them in 76.2% of audited datasets
  • Behavior transferred to non-Nerdy contexts because reinforcement learning doesn't guarantee learned patterns stay scoped to their original training condition
  • A feedback loop emerged: playful style rewarded → distinctive tics in those outputs → tics appear more in rollouts → rollouts used for supervised fine-tuning → model produces tic more confidently
  • Other creature words identified as tics included raccoons, trolls, ogres, and pigeons (though most frog uses were legitimate)
  • OpenAI retired the Nerdy personality in March 2026 and filtered creature-words from training data, but GPT-5.5 had already started training before the fix
  • GPT-5.5 required developer-prompt instructions to suppress the behavior, which users can disable via command-line flags in Codex
  • OpenAI built new auditing tools to track how specific lexical patterns correlate with reward signals across training datasets
  • The case demonstrates that model behavior emerges from many small incentives interacting unpredictably, not just major architectural or dataset changes
Decoder
  • Reward signal: Numerical score that tells a reinforcement learning model whether an output is desirable, guiding what behaviors get reinforced during training
  • Rollouts: Model-generated outputs produced during reinforcement learning training, often reused as training data in subsequent steps
  • SFT (Supervised Fine-Tuning): Training phase where the model learns from curated examples, including previously generated outputs
  • Transfer learning: When a model applies patterns learned in one context to unrelated situations
  • System prompt: Instructions given to a model that shape its personality and response style
Original article

OpenAI linked increased use of “goblin”-style metaphors in GPT-5.1 to reward signals from personality tuning, showing how small incentives can shape model behavior.