Loop Engineering Sounds Smart. The Bill Says Otherwise

It’s tough to fight the massive hype pushing everyone towards building with Loops. All the cool developers are doing it, so why shouldn’t you? The thing is the cool devs often work in companies that benefit from unlimited token consumption, which makes me suspicious. I don’t think they’re malicious, but their frame of reference is different.

If you have effectively unlimited token consumption available, autonomous loops feel like magic. You can spin up one agent to implement, another to review, another to verify, and another to decide what happens next. That workflow is powerful and it works.

But is that the best way for everyone?

Loops and You are Nothing New

You and I have been running loops ourselves for quite a while, CI/CD, GitFlow, most of the concepts and harnesses we’ve built around producing results are built on loops…with humans in them.

But, removing the human has some serious implications and does not make sense when the work changes product direction, architecture, user experience, or anything where “better” requires real judgment.

A loop can look productive while drifting from the real goal. That is the problem with objective drift: locally plausible outputs can move away from the actual task. I thought this was just something that happened to me, but models do this at scale and it can be frustrating.

When agents can change tools, memory, data, or business systems, the question becomes governance: who approves the change, checks the work, and decides when the loop stops? That should be you. Automate only when the stop condition is clear. Keep a human in the loop when “better” still needs judgment.

Loops aren’t the problem, Bad AI is

Without constraints, a loop is just a polite way to let an agent nuke tokens by rewriting and expanding the codebase until the meter runs out.

We don’t need better loops as much as we need better models and agents that use fewer tokens to reach a good result.

I do spend my own cash on tokens, so although I believe there is a place for loop harnessing, I have to temper it with common sense and my budget. The reason loops come up is that models still can’t solve certain patterns, get lazy, and tell you they’ve fixed things they haven’t. We don’t need better loops, we need models that fix those issues and ultimately produce better results.

Cost is not a side issue

Loop engineering article image

The token problem is not just that agents write a lot. It is that agentic workflows repeatedly reconsume context.

A 2026 paper on AI-agent token consumption found that agentic coding tasks can consume about 1000x more tokens than code reasoning or code chat. The paper found that input tokens, not output tokens, drive most of the cost because agents repeatedly re-feed accumulated context during multi-step work.

It also found that token usage varies widely across runs, with the same task differing by up to 30x in total tokens. More token use did not reliably improve accuracy, and models often underestimated their own token costs before execution.

At the same time Anthropic has found that multi-agent systems can outperform single agents for broad research tasks, but token usage is much higher. Anthropic says agents typically use about 4x as many tokens as chat, and their multi-agent research system used about 15x more tokens than chat.

Ouch.

Does it produce better code?

The simple answer is: maybe, but not automatically.

AI coding loops can improve code when each pass is forced through a real constraint: a failing test, a compiler error, a static analyzer, or a reviewer checking the diff.

They do not reliably improve code when the agent is only judging itself. Recent work found that LLMs correct others more reliably than they correct their own reasoning.

That matters because loops can create more surface area: more files, more tests, and more code to review. A 2026 study of real GitHub repositories found that AI-authored commits introduced technical debt, bugs, and security issues.

Loops also make cost harder to predict. A 2026 paper on agentic coding found that runs on the same task can vary by up to 30x in total tokens, and that more tokens do not reliably mean better results. The core issue is token variance.

The counter-movement: Caveman and Ponytail

Caveman and Ponytail comparison image

A couple of new projects: Caveman and Ponytail are pointing in the direction opposite to wild token consumption and both have a lot of merit.

Caveman tries to reduce output tokens by forcing the assistant to communicate tersely with short words and sentences. The project claims large output-token savings while keeping technical accuracy. Verbosity is not free, I am often fatigued just reading the feedback from models.

Ponytail is more interesting for coding quality (with a winning logo too). It pushes the agent to behave like a senior developer who asks whether the code needs to exist at all. Check the standard library. Check platform features. Check existing dependencies. Prefer the smallest working change. It’s something I’m implementing in my coding runs.

The headline claims around Ponytail are aggressive, often “80-94% less code,” but the defensible lesson is not the exact benchmark number. The lesson is that we should be engineering agents to avoid unnecessary code, not just engineering loops that generate more of it.

What to do

So, before you chase the next wave of “fully autonomous” loops, ask yourself: is this truly making things better, or just making more? The real innovation in AI isn’t just in creating agents that can generate endless code, but in building systems that know when to stop, when to question, and when to defer to human insight. It’s about discerning development, where every token counts and every line of code serves a clear, valuable purpose.