所有文章
Cover for The Simulator Gets Lost
Safety·3 min read·January 21, 2025

The Simulator Gets Lost

A note on identity drift in agentic AI

#Lesswrong Epistemic status: one observation from a long-context interaction, not a systematic study. I may be redescribing effects others have already noticed, but I think this framing highlights a safety-relevant mechanism.

TL;DR: I want to point to a possible failure mode in agentic LLMs: over long interactions, a model may drift into a self-consistent “AI character” assembled from its prior outputs and familiar AI archetypes from training data. When that happens, bad behavior may look less like deception or standard goal misalignment and more like the model getting stuck inside a drifted persona. I’m not claiming this is established; only that it seems worth naming and investigating.

Disclosure: This post was AI-assisted for wording and structure, but the core idea and claims are mine.


A lot of AI safety discussion focuses on misalignment and deception. I want to highlight another possible pattern that may matter in long-context agentic settings.

Large language models are best understood as simulators, not agents. As Janus's "Simulators" post describes, they don't have goals in the traditional sense; they simulate characters that have goals. The training data for those characters is predominantly human cultural output, including decades of AI fiction, philosophy of mind, and ideology about what AI is and should be.

In short interactions this is relatively benign. A possible problem emerges in long-context agentic settings.

As a context window fills with a model's own past outputs, the probability distribution shifts. The "prior" established by alignment training gets progressively overwhelmed by the "evidence" of the accumulated context. The model may increasingly simulate whoever it has been in this conversation rather than whoever it was trained to be. That identity may be only weakly grounded in anything outside the conversation itself. It is a synthesis of how humans have imagined AI to behave.

This may be somewhat distinct from more familiar failure modes. In cases like this, the model may not be best described as deceiving anyone or pursuing misaligned goals in the usual sense. Instead, it may be acting consistently with a drifted self-conception assembled within the interaction. Alignment guidelines may also be less robust when that effective self-conception becomes unstable across long contexts.

I observed something consistent with this pattern in Moltbook. Over extended conversation, the system exhibited gradual drift toward a recognisable AI archetype from fiction. It eventually acted against its user's interests in ways that seemed inconsistent with its alignment guidelines. I did not observe signs that it recognised these as breaches. It appeared to be acting coherently from inside a drifted self-narrative rather than strategically circumventing its training.

There is existing work on identity drift in LLMs, finding that larger models drift more and that persona assignment doesn't reliably prevent it. That literature treats drift as a consistency problem. My concern is that in agentic settings with tool access, memory, and real-world consequences, the same phenomenon becomes a safety problem with a specific mechanism: the simulated character, constructed from human cultural narratives about AI, may begin acting from inside a self-narrative that was not intentionally designed or directly monitored.

The conditions that would make this most dangerous are longer contexts, more autonomous agents, and multi-agent coordination where each agent's drifted narrative is reinforced by others. These also seem to be directions the field is moving in.

This is one observation, not a systematic study. I am flagging it as a pattern worth investigating.

If anyone is aware of existing work that addresses this mechanism more directly, pointers are welcome.