Spec-driven development is dead. Long live plan mode
In my colleague Chris's recent post, The AI pricing honeymoon is over, he made the case that token spend is now something engineers have to design around rather than treat as someone else's finance problem.
He covered four techniques that can help: prompt caching, context discipline, in-task routing via the Advisor tool and a gateway like LiteLLM.
In this post we’re going to look at how Spec Driven Development, one of many similar approaches where you crank out a long-form spec, decompose it into a hierarchy of artefacts and then start to implement makes it more difficult to use those cost reduction techniques.
We’ve used SDD and others on a handful of projects of different shapes and the feedback has always been fairly consistent:
It used a lot of tokens, made a lot of files but didn't necessarily deliver the outcome expected.
SDD is a workflow built for a world with unlimited tokens. Unfortunately a world that is being replaced with usage based billing.
What SDD actually promises
It would be unfair to dismiss this without naming the case for it. Sean Grove, who works on the Model Spec at OpenAI, gave a talk at the AI Engineer World's Fair last year called The New Code in which he argued that the code you write represents only ten to twenty percent of your value as an engineer. The rest is "structured communication": pinning down the problem, capturing constraints, planning the solution, defining how you'll know it works.
The various SDD tools sit on top of that idea. Birgitta Böckeler at Thoughtworks has done the cleanest survey I've seen in Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl, where she breaks the space into three flavours: spec-first, spec-anchored and spec-as-source. GitHub's spec-kit sits firmly at the spec-first end. You scaffold a spec, a plan, a sub-plan and a task list, then drive an AI coding agent through the leaves.
The pitch is reasonable. Specs force you to make the implicit explicit, a planned hierarchy means the model has something concrete to drive towards rather than improvising its way through the work, and the various artefacts are meant to act as scaffolding around that.
I wanted that. The work I was doing had enough moving parts that "go and implement it" wasn't going to fly, and I'd previously had decent results from breaking work into phases by hand and feeding them in one at a time. So I gave SDD a fair run. Multiple runs. Different team members. I came away with three observations that turned up on every project.
Three things kept happening
We used a lot of tokens
Each artefact gets reasoned about in isolation and in the context of its parents. By the time the agent is at task level, it's processing the whole hierarchy on every step. On the spec-kit projects I ran, the cost ran several multiples of the equivalent plan-mode flow for the same scope of work. I'm not the only one seeing this. The GitHub spec-kit issue tracker has a long-running report of excessive token usage where Pro-tier users were burning through limits in under an hour. Scott Logic ran the same exercise and reported their spec-kit run was around ten times slower than iterative prompting for the same work.
There's a deeper reason for this beyond "more files cost more tokens", and it's the one Andrej Karpathy keeps making. Karpathy argues that the real skill in industrial LLM apps isn't prompt engineering, it's "context engineering": choosing precisely what to put in the window. SDD does almost the opposite. It generates lots of artefacts and then hauls the whole tree back through the model on each step. Drew Breunig has documented what happens when you do that under the heading of context rot: models get distracted by stale plans, lean on accumulated context rather than the task in front of them and start repeating themselves. So you're paying twice. Once on the input bill and once in the form of degraded output.
The artefacts drifted from the code
Once implementation started, the lower-level docs (tasks, acceptance criteria) drifted from the higher-level ones (spec, plan), and both drifted from the code. Nothing kept them in sync. By the end of each project I had documents that disagreed with each other and with what shipped.
This is another known issue. Böckeler captures it in passing in her SDD piece, noting that the spec-kit workflow felt like "overkill for the size of the problem" and that in the same time it took to run and review the spec-kit pipeline she could have done the work with plain AI-assisted coding. Once the spec is no longer the source of truth, it's just another file to maintain.
The output wasn't better
This is the killer. The whole pitch is that you spend more tokens to get a more structured result. I didn't. I got a roughly similar quality of output, in more files, with more drift, for more money. Prezi's engineering team reached the same conclusion in We Tried Spec-Driven Development So You Don't Have To. Scott Logic's review landed in the same place: the code spec-kit produced was fine, but no better than iterative prompting, and the workflow shipped a simple bug that a tighter loop would probably have caught.
What I do instead
Plan mode in Claude Code, every time:
State the goal and the constraints. Let it read the relevant code.
Iterate on the plan inline. Push back, ask for alternatives, narrow the scope.
When the plan is solid, exit plan mode and implement.
If the work is chunky I ask for the plan to be split into phases, then use the Atlassian MCP to fan the phases out as Jira sub-tasks. But the fundamental loop is small: one document, one round of iteration, then code.
The reason this works is unglamorous. Most of the value of "specs" comes from forcing you to think about the problem before you start, and plan mode forces that. The rest of SDD's structure (sub-plans, task lists, acceptance criteria) is artefact-shaped overhead. It looks like rigour on the page but isn't doing the work the structure suggests it should.
Simon Willison draws a useful line between "vibe coding", where you accept whatever the model produces, and using an LLM as a typing assistant on code you've reviewed and understood. Plan mode is the second of those by design. You sign off on a plan you've actually read.
The lean lens
SDD pays for structure with reasoning passes. Every artefact is generated, every artefact is read back into context to inform the next and the whole hierarchy gets dragged through the model on each implementation step.
In a world where inference is being sold below cost you can afford that. In the world we’ve moved into, the cost greatly outweighs the benefits.
A surgical plan is the smallest scaffolding that gets the job done. In practice that's one document, one round of iteration, enough constraint to keep the implementation honest and nothing extra. That's the workflow shape that survives a lean budget.
There's a recurring pattern in AI tooling right now where the more elaborate workflow loses to the simpler one. Multi-agent orchestration with five role-specialised agents is, for most jobs, beaten by one capable agent in plan mode. Hierarchical spec-and-plan workflows are beaten by one good plan and a clear next step. Token budgets get spent on the meta-work rather than the work.
Is SDD the new yak shaving? I’ll let you decide…
The pattern shows up elsewhere
BMAD and frameworks like it lean on a cast of role-specialised subagents (architect, developer, scrum master) running an agile-shaped loop. It looks reassuringly familiar to anyone who's worked in a real team. But the agile ceremony is a coordination protocol for humans who can't read each other's minds. A single capable model with the whole context already in its window is not paying that coordination tax, so reintroducing it just to make the workflow look like a sprint is forcing a human shape onto something that doesn't need it.
TDD-as-prompt-pattern is the other one. Telling the agent to write a failing test first, watch it fail, then write the implementation to make it pass is the same red-green-refactor loop we run as humans, encoded as extra turns. The loop exists for humans because we don't otherwise know if the test is exercising what we think it is. A predictive model isn't fooled by the same things and the extra round trips are bought at full token price. Tests still matter as the safety net under what shipped. Forcing the model to invent them in a specific order, ahead of the code, is the bit that doesn't pay rent. The same pattern keeps showing up: a workflow that earns its keep with humans, ported to a model that isn't paying the cost it was designed to amortise.
Where SDD might still earn its keep
I'm not arguing against specifications. I'm arguing against artefact hierarchies generated by AI on the way to writing code. The spec-anchored and spec-as-source models that Böckeler describes are a different conversation. If your spec is a long-lived artefact that genuinely outlives any single implementation (a public API, a regulated domain, an ADR that captures a decision you'll need to defend in three years' time, a product surface that multiple teams will reimplement) then the calculus changes. The spec is doing real work, paid for once, used many times.
The case I'm pushing back on is generating spec, plan, sub-plan and tasks files as scaffolding for a single feature.
The takeaway
If you find yourself impressed by the architecture diagram of your AI workflow, treat that as a flag. Most of what we used to call the workflow is plumbing around the model now, and the bill for that plumbing tends to land on the line item nobody was looking at last year.
If you've had a different experience with SDD I'd love to hear it. The above is one engineer across a handful of projects, but I've had this same conversation with three other senior engineers in the last fortnight and we've all landed in the same place. Plan mode does the job. Once tokens stop being someone else's problem to pay for, the cheapest workflow that still ships wins by default. That isn't SDD.