The Shape of a Spiral
Why I stopped detecting emotions and started modeling agent trajectory.
There’s a moment when you know you’ve lost the thread with an AI coding agent. The model keeps circling, rewriting the same function, touching the same five files. You’re not frustrated yet, but something feels off. That feeling has a shape. We studied what it actually looks like.
Unlost gently tells your agents when to stop frustrating you.
Unlost is an open source local-only CLI that offers broadly three main tools:
- A companion to your favorite coding agent that keeps it on track before you grow frustrated
- A recall utility that helps you bring back context when you forgot where you were a few days or months ago
- A query interface to understand what was the context of a particular change or file
The Problem
It started with a performance bottleneck. When I first built unlost replay, it was just an afterthought: a way to backfill memory from old Claude Code or OpenCode transcripts. The problem was that I was burning my LLM budget while waiting half an hour for replays to complete.
An Intent Capsule is the atomic unit of agent memory: a structured snapshot that distills messy chat into a queryable record of decisions, rationale, and code symbols. In a nutshell, a capsule is the basic unit of information Unlost uses to perform any work: recall, query or intervention.
That was the extent of it. Just a performance problem. Make replay faster.
The Experiment
To figure out where I could cut costs, I needed data. I created what I called the Marathon datasets: real-world sessions spanning 300, 700, even 1,200 turns. The question I wanted to answer was simple: could I build these capsules without calling an LLM on every single turn?
When I analyzed the data, I noticed that about 42% of turns were what I started calling “Pivotal” moments, turns where the user (well, me) set strong decisions or goals. The rest looked like incremental noise: refinements, clarifications, routine execution. You know, the typical “carry on.” I thought I could just ignore those 58% completely.
The First Surprise: You Can’t Skip Turns for Search
When I tested retrieval on compressed capsules versus raw dialogue, I was surprised. Raw text embeddings achieved 88.5% recall while my “smart” heuristic extraction only reached 73.9%.
The signal I needed for search was hiding in the turns I’d labeled as “noise.” The core semantic context (decisions, rationale, intent) was buried in the middle of apparently incremental exchanges. I couldn’t just throw away 58% of the conversation and expect good retrieval.
Raw Text Wins
You cannot classify your way out of this. The signal is distributed across every turn, not concentrated in “important” ones.
This meant I had to separate two concerns: what to embed for search versus what to extract with an LLM.
Embedding is cheap and local. It gives you searchable memory. But extraction is what gives you understanding: why a trajectory is spiraling, what the user’s rationale was, whether a decision conflict is brewing. Keywords alone can’t detect “logic churn” or “decision conflict.” You need an LLM to build structured memory of why things happened, not just what was said.
So I couldn’t skip turns entirely, but I could be selective. Embed everything (for search). Extract only pivotal turns (for understanding).
That solved the retrieval problem. But it didn’t tell me anything about when sessions go wrong. For that, I had to look at something else entirely.
The Second Surprise: Emotion is Too Late
I went looking for those “pivotal” moments where frustration spikes, the ones where the user finally snaps and the agent finally fails. But in a 700-turn session, failure didn’t happen at turn 699. It started at turn 400.
I watched the pattern repeat across multiple sessions: the agent touches the same five files over and over, logic churns, plans change every turn but the underlying symbols stay static. These weren’t discrete “pivotal” moments. They were incremental turns that looked routine on the surface but were actually the beginning of a negative spiral.
Friction isn’t a spike. It’s a build-up.
It’s a slope. By the time I would say “Wait, that’s not right,” I would already be on the trajectory and the cost to undo the drift was already ruinous.
The Collaboration Imbalance
When the interaction becomes this lopsided, conversational fluency masks factual errors. The user stops verifying and starts accepting.
The Reframe: Sessions Are Trajectories
So I had two failed assumptions:
- I thought I could classify turns to skip the “noise” (but signal is distributed: 88.5% vs 73.9%)
- I thought I could detect frustration to intervene (but emotion is lagging: turn 400 vs turn 699)
Both failures pointed to the same truth: Agent sessions are trajectories, not transactions.
I was treating sessions as discrete events when they’re actually continuous. And I was waiting for emotional signals when they’re always too late.
This reframe isn’t just my observation. Research backs it up. Zhu et al. (2024) in Nature Scientific Reports found that conversational presentation mode increases credibility judgments even when accuracy is low. Users detect inaccuracies better in static text than in conversational agents. We’re better at reviewing code than keeping track of the chat we’re having with the agent.
The EASE ‘25 paper confirmed the trajectory angle: repeated inaccuracies, intent misunderstanding, and context window pressure are the primary drivers of developer strain.
Structural instability leads to Momentum, which leads to Affect (which is lagging).
If a user is angry, the agent has already lost. But if we can measure Instability Intensity - a weighted signal combining logic churn, symbol repetition, and grounding failure - we can intervene while the trajectory is still manageable.
The Trajectory Model
By modeling the conversation as a trajectory, we detect when the session “branches” into a failure basin long before the user reaches their emotional limit.
What We Built
Based on this reframe, I moved from binary friction detection to a stratified policy with three controller states:
- Stable: Session is on track, no intervention needed
- Watch: Early warning signals detected, monitoring closely
- Intervene: Trajectory is spiraling, time to inject guidance
The key was identifying the specific flavor of failure that triggers each transition. I categorized these into three distinct basins:
- Loop: Repetitive stalls, symbol repetition, logic churn
- Spec: Alignment debt, instruction repeats, corrective keywords
- Drift: Grounding failure, hallucinated paths, ignores user files
The chart shows the core insight: by the time you feel frustrated (turn 699), instability started building hundreds of turns earlier (turn 400). The intervention window is in between.
I also found that as input context grows, friction rate doesn’t increase linearly. It hits an inflection point between 8k and 12k tokens. Past that threshold, the probability of grounding failure or instruction misunderstanding more than doubles. This isn’t agent laziness. It’s structural failure of the interaction model under high context load.
Friction Rate vs. Context Size
The regulator itself uses EMA-smoothed symptom channels to track basin intensity. It respects “Coffee Pauses,” decaying controller state across temporal gaps so we don’t misattribute human rest to agent stalls. When fluency is high and user input is passive, the controller escalates intensity because it knows the user might be blindly accepting without verifying.
The interventions use a “Staff Engineer” voice: low-ego micro-agreements and one-at-a-time assumptions. Depending on urgency, Unlost can inject a Compass Note to clarify rationale, a Context Anchor to restate the goal, or an Emergency Brake to stop destructive execution.
Back to the Beginning
So did we fix the performance problem?
Yes, but not how I expected.
I designed Hybrid Replay as the new default in v0.6.4. It works by tiering the extraction:
- Always Index: Every turn is embedded locally using its raw text. This preserves the 88.5% retrieval recall without a single API call.
- Selective Extraction: We run a local “Pivotal Check.” If the turn shows emotional friction, corrective keywords like “actually” or “wait”, or high structural churn, only then do we send it to the LLM.
On one of my OpenCode datasets, about 58% of the turns were free to analyze as they didn’t require the full LLM loop. Unlost is faster and cheaper in most cases.
The LLM extraction is for high-fidelity friction detection (understanding why a trajectory is spiraling), structured rationales for unlost recall, and building a memory of why things happened, not just what was said.
But the real win wasn’t the speed. It was realizing that optimizing for speed forced me to understand what actually matters. I couldn’t just skip “unimportant” turns because every turn carries signal. I couldn’t just detect frustration because by then it’s already too late.
The performance fix became a trajectory fix.
Why This Matters
For developers: You feel stalled because stalled states have a shape, not because the model is “bad.” Early detection prevents wasted cycles and the slow escalation of frustration.
For teams: Without a trajectory notion, teams optimize prompts and tooling around per-turn output quality. That misses the sequence signal - the pattern over time that actually predicts failure.
For tool builders: Existing UX for agentic coding focuses on correctness and error rates. This work shows that correctness alone doesn’t capture session health. A session can be “correct” turn by turn and still spiral into uselessness.
What I Still Don’t Know
I’m still in the early days of this research. I don’t know how these thresholds generalize across developer personalities. And not all repetition is instability - deep planning or careful refactoring can look repetitive on the surface. Distinguishing productive exploration from a death spiral is an ongoing challenge.
But I know one thing: the shape of the spiral is visible long before the crash. Unlost is to pull the lever while you’re still in the zone.
Unlost v0.6.4 is now live. Read the docs or check the source.
References
- Zhu, Y., Wu, Y., & Miller, J. (2024). Conversational presentation mode increases credibility judgements during information search with ChatGPT. Scientific Reports (Nature). DOI: 10.1038/s41598-024-67829-6
- Martinez Montes, C., & Khojah, R. (2025). Emotional Strain and Frustration in LLM Interactions in Software Engineering. Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE ‘25). DOI: 10.1145/3756681.3756951