Loop Engineering

The Loop That Picked Its Own Bedtime

The agent had finished a round of work and now had to answer one small question before going dark: how long to wait before waking up again. It picked 270 seconds. Not because the next task would be ready in 270 seconds, nobody knew when the next task would be ready, but because the prompt cache holds for about five minutes and 270 stays inside that window. The delay was chosen against a cache TTL. The actual work had no opinion about it at all.

That gap is the whole job. The loop keyword that scheduled the wakeup is one line. Everything that made 270 the right number instead of 300 lived in the space between two iterations, and that space is what I want to name. Call it Loop Engineering. The loop is the thing that repeats; the engineering is what it carries across the gap, and the decision about when it should stop repeating at all.

The Loop Keyword Is Trivial

Starting a loop costs nothing. /loop 5m /foo runs a slash command every five minutes until you kill it. Drop the interval and the model self-paces, scheduling its own next wakeup. A workflow can fan out subagents on every pass with a parallel() call. None of these are hard to write, and that is exactly the problem, because the writing is where people think the work is.

The work is between the wakeups. What state survives from one iteration to the next, what makes the thing stop, what each pass costs you. A loop you can write in one line can still run forever, double-act on every pass, or quietly spend a fortune confirming something it already knew three iterations ago.

Karpathy called Claude Code the first convincing demonstration of what an LLM agent looks like, something that in a loopy way strings together tool use and reasoning for extended problem solving, and he is right that the loop is the shape of the whole thing. The part he names as the primitive is the part you write in one line. The part you do not write is everything the loop has to carry across the gap.

In Mat Pocock’s reading of Anthropic’s Building Effective Agents, loops are already classified, by orchestration topology and by where the evaluator-optimizer feedback runs, and that is settled ground I am not relitigating. I sort them on a different axis: what triggers the next iteration. That single axis splits cleanly into four kinds, though the four buckets are not the interesting part, since anyone can enumerate clock, delay, discovery, money. The traps stack in any real loop, but the load-bearing claim is that each trigger drags one decision behind it that you do not get to design around. A clock forces idempotency, a chosen delay forces cache-aware pacing, discovery forces a convergence test, money forces a ceiling. The trigger picks the trap, and the rest of this is those four traps.

Interval Loops Force Idempotency

An interval loop wakes on a clock. Poll a queue every thirty seconds, babysit a deploy until it goes green, re-run a cron job nightly. The trigger is dumb and external, which is exactly the problem, because the clock does not know or care what your last iteration did.

Every wakeup re-reads the world from scratch. That is the whole hazard. If the body is not idempotent, the second pass sees the same condition the first pass saw and acts on it again. It posts the comment twice. It commits the same change twice. It re-fixes a thing that was already fixed, because nothing told it the fix already landed. The clock will happily fire your non-idempotent body forty times before you notice, and you inherit forty laps of compounded side effects.

The fix is not to remember what you did last time. Memory across wakeups is fragile, and the appeal of an interval loop is that each pass is stateless and cheap. The fix is to make every iteration start from a known-clean snapshot, and to refuse to run when that precondition does not hold.

The refactor-loop I run encodes exactly this. It refuses to start on main or on a dirty tree, and on any red signal it throws the working tree away before the next pass.

# precondition: never run on main, never run dirty
branch=$(git symbolic-ref --short HEAD)
[ "$branch" = "main" ] && { echo "refuse: on main"; exit 1; }
[ -n "$(git status --porcelain)" ] && { echo "refuse: dirty tree"; exit 1; }

# each iteration's mechanical gate; red rolls the world back
if ! run_live_tests; then
  git reset --hard
  git clean -fd
  continue   # next wakeup starts from a known-clean snapshot
fi

That refusal is idempotency enforced as a precondition rather than a memory. By guaranteeing every iteration begins from the same clean snapshot and rolls back to it on failure, you remove the possibility that pass two acts on pass one’s half-finished mess. The loop can fire as many times as the clock wants, and the worst case is wasted compute rather than a corrupted tree. For more on how that harness draws the line around an agent, I wrote up the agent harness separately.

Dynamic Loops Force Cache Windows

A dynamic loop lets the model pick its own next delay. It finishes a pass, decides how long to sleep, emits a short reason, and schedules its own wakeup somewhere in a clamped window of sixty seconds to an hour. This feels like the smart version of an interval loop, and it is, but the engineering decision it forces is one almost nobody sees coming, because the right delay has nothing to do with the task.

It is a function of the prompt cache. The cache has a roughly five-minute TTL. Your context, the expensive part you pay to re-read every time the model wakes cold, stays warm for about that long and then evaporates. So the delay you pick is really a bet on whether your context will still be resident when you come back.

That makes three regimes, and the middle one is a trap.

# next-wakeup delay vs the ~5-min (300s) prompt-cache TTL
delay = 270    # inside the warm window: context still cached, cheap resume
delay = 300    # worst case: cache just expired, you eat the miss AND
               #   waited barely long enough to amortize nothing
delay = 1200   # 20+ min idle: one cold re-read amortized across a long wait

Pick 270 seconds and you land inside the warm window, so the next pass resumes against a live cache and costs almost nothing to start. Pick exactly 300 and you get the worst of both, the cache has just expired so you eat the full cold re-read, and the wait was too short to amortize anything. Pick twenty minutes or more and you are deliberately in cold-read territory, but now one cache miss amortizes across a genuinely long idle, which is the right call when there is nothing to do soon anyway.

This compounds the moment a wakeup does real work. Each pass can fan out a whole workflow of subagents, like a multi-model review council, and every one of them reads from the same context you are deciding whether to keep warm. So the delay is not just “when do I wake up,” it is “how often do I pay to reconstitute my entire working set.” Pacing a dynamic loop is cost control disguised as a sleep timer, and the model emitting “sleeping 270s, staying in cache” as its reason is the tell that it understood the real question.

Until-Dry Loops Force Convergence

An until-dry loop does not wake on a clock or a chosen delay. It wakes because the last pass found something, and it keeps going until enough consecutive passes find nothing new. The trigger is discovery itself, and this is the loop where the body stops being a prompt and becomes an orchestration in its own right.

Each round fans out in parallel. A spread of diverse-lens finders go looking for issues from different angles, and behind them a wave of refute-by-default verifiers try to knock down everything the finders claimed, because a finding nobody can defend is noise. The adversarial-review loop I run is built exactly this way: discover, skeptically verify, repeat, and call it done only after a stretch of empty rounds. It mutates nothing and emits confirmed findings as data, which is what lets you run it hot without fear. This is the generation-verification loop Karpathy keeps pointing at, where the discipline is to make verification fast to win and keep the generator on a tight leash. The finders generate, the verifiers leash, and the parallel fan-out is what makes verification cheap enough to run that hot.

The leash was never my problem. The convergence test was, and I got it wrong the first time in a way that is worth owning. The rule is to dedup each round’s new findings against everything you have ever seen, not against everything you confirmed.

const seen = new Set();        // every finding ever surfaced, confirmed or not
let dry = 0;                   // consecutive rounds that added nothing new

while (dry < K) {
  const found = await parallel(
    lenses.map(l => () => discover(l))   // diverse-lens finders, in parallel
  );

  let addedSomething = false;
  for (const f of found.flat()) {
    const key = fingerprint(f);
    if (seen.has(key)) continue;         // check seen, NOT confirmed
    seen.add(key);                       // record BEFORE the judge rules
    addedSomething = true;
  }

  const fresh = found.flat().filter(f => seen.has(fingerprint(f)));
  const confirmed = await parallel(
    fresh.map(f => () => verify(f))      // refute-by-default skeptic
  );
  report.push(...confirmed.filter(r => r.survives));

  dry = addedSomething ? 0 : dry + 1;
}

My first version deduped against the confirmed list. It seemed obviously right: you only care about real findings, so dedup against the real ones. But the verifiers reject plenty of findings, and a rejected finding never enters the confirmed list, so the very next round the finders rediscover it, the verifiers reject it again, and addedSomething flips true on a finding that adds nothing. The dry counter never climbs. The loop runs forever, busy and convergent-looking and completely stuck, burning a parallel fan-out per round to relearn the same rejection. Recording into seen the instant a finding surfaces, before the judge ever rules, is what closes the trap. That one-word difference, seen versus confirmed, decides whether the loop terminates.

The last piece is a completeness critic that runs before you trust an empty round, asking what lens or modality never ran at all. Zero findings because the work is clean and zero findings because you only ever looked one way are indistinguishable from inside the loop, and the critic is the thing that tells them apart. Going dry has to mean the searchers are exhausted, not the search.

Budget Loops Force a Ceiling

A budget loop runs while there is money left to spend. The trigger is the wallet. A workflow exposes a budget object with a total, what it has spent, and what remains, and the natural loop is while (budget.remaining() > threshold). It reads clean and it has a hole in it.

// guard on budget.total: an unset budget must NOT run to the agent cap
if (!Number.isFinite(budget.total) || budget.total <= 0) {
  throw new Error("budget loop needs an explicit total");
}

while (budget.remaining() > threshold) {
  const round = await runImprovementPass();
  if (round.plateaued) break;   // converged: stop paying to re-confirm it
}

The guard on budget.total is load-bearing. With no budget set, remaining() returns Infinity, the comparison is always true, and your budget loop quietly runs to the hard agent cap instead, which is a thousand-agent backstop that exists to catch runaways and not to be your stopping condition. An unset budget should run zero iterations, not all of them.

The subtler failure is a loop that keeps paying after it has nothing left to buy, and this is the one that bit me. I ran an overnight self-improvement loop, the kind that scores a thing, applies a patch, re-scores, and keeps or reverts. It converged early, the score flattened and stayed flat. The loop did not know that, because its only stopping condition was the budget, and the budget had not run out. So it kept going for hours, applying patches that got reverted, re-confirming a plateau it had already reached. Every dollar after that point bought nothing. The plateau detection that a proper self-improvement loop uses, stop when the score stops moving, is the thing I should have wired in from the start. Budget is one way to denominate a ceiling; “the score stopped moving three rounds ago” is a better one, and the loop construct gives you neither for free.

The Gap Was the Job

That self-paced agent from the start, the one weighing 270 against 300 seconds, was doing the only thing any of these loops ever do. It was deciding when not to wake up again. The while stayed one line the whole way through, from the refactor-loop’s clean snapshot to the overnight loop that should have quit hours before it did.

Idempotency, pacing, convergence, ceiling. Four sections, four names, and underneath them the same move every time: choosing the stop. The refusal to run dirty, the cache window, the dry counter, the plateau: every one of them is a place the loop is told to quit. None of it is the loop, all of it is the part the loop does not hand you. Starting one costs nothing, which is exactly why people think the work is in the body. You can fire off a recurring agent or a swarm before you finish reading this sentence, the same way you can drop a one-line tool into a config and call it shipped, which is a thing I argued about when I said skills are the new dotfiles. Stopping one well is where all the engineering actually went, and most of mine went there late, after the loop had already been running too long.

Loop Engineering

The Loop That Picked Its Own Bedtime#

The Loop Keyword Is Trivial#

Interval Loops Force Idempotency#

Dynamic Loops Force Cache Windows#

Until-Dry Loops Force Convergence#

Budget Loops Force a Ceiling#

The Gap Was the Job#