The first signs of cost pressure in agentic engineering

Last July, Anthropic introduced weekly usage caps on Claude Pro and Max after a small number of power users essentially ran Claude Code around the clock. A month earlier, Cursor overhauled its pricing and had to apologise publicly when users discovered they could burn through a month's credit in a morning. In September, GitHub announced it was removing the $0 default budget that had been quietly blocking premium-model requests on Copilot enterprise accounts, shifting admins from opt-in overage billing to opt-out. None of this reads as greedy vendors. It reads as an economic model for agentic AI that was always a subsidy in disguise.

The cost pivot hasn’t landed yet

This is where agentic engineering stops being about adoption speed and starts being about cost, control and system design.

Right now, we are helping organisations across financial services drive agentic AI adoption at pace. The engineering playbook for the last two years has been simple - give developers unfettered access to the best tools available and close the productivity gap with whoever moved first. That logic holds while the gap is open. It does not survive the moment the gap closes and attention shifts from who can adopt fastest to what the adoption actually costs.

Most of the customers we work with are running Claude Code, Codex CLI or equivalents directly against the model APIs via AWS, Azure or GCP, so vendor-side subscription caps don't really touch them. What they have instead is unlimited developer AI spend. Today, that reads as an investment in productivity. A year from now, when someone from the finance team opens the Q1 invoice and finds a seven-figure line item from a single platform team, it will read rather differently.

Changes to vendor subscription plans are the first tremor. The real shift comes when the adoption race stops being the binding constraint and enterprise finance teams meet their first seriously sized AI bill. We are already having early versions of that conversation with CTOs who love what agentic coding has done for their engineering org - particularly the gains in agentic coding workflows - but can’t defend the burn rate to a board. The question stops being "how much AI do we want?" and becomes "how cheaply can we buy the outcome we've come to depend on?".

Routing is the architectural answer for agentic AI systems

The answer quietly forming inside modern agentic AI systems - and increasingly within agentic engineering - is routing. Rather than sending every prompt to the most capable model available, a harness dispatches each task to the cheapest model that can credibly handle it. Cursor's Auto mode already does a version of this, picking a premium model per prompt and failing over when a provider degrades. GitHub has gone further and shipped an Auto plus explicit model picker across Copilot, rotating between a tiered pool of Anthropic, OpenAI and Google models.

Anthropic has taken the pattern a step further with its Advisor tool, now available via the Claude API. A cheaper executor model (Haiku or Sonnet) runs the bulk of the generation while a higher-intelligence advisor (Opus) is consulted mid-task for strategic guidance. The executor pays for the routine tokens at its lower rate, and only the planning moments draw on the frontier model's capability. That's routing not just across requests but within them - the same shape of argument, applied at a finer grain. It's not a stretch to assume that the same pattern will land inside Claude Code itself, with a smaller model driving the loop and Opus consulted only on the moments that genuinely call for it.

The academic groundwork for this has been in place for years. FrugalGPT demonstrated that well-designed cascades can deliver frontier-comparable quality at an order-of-magnitude lower inference cost. RouteLLM showed that preference-trained routers can match top-tier quality on the majority of traffic while reserving the frontier models for the hard minority. The shape of the argument is simple - most prompts don't need the smartest model in the world. If you can cheaply tell the easy ones apart from the hard ones, you save a lot of money without anyone noticing. The harness becomes the dispatcher. The model becomes the worker.

Small models are reshaping the economics of agentic AI

Routing agentic AI systems only works if the bottom tiers are genuinely useful, with the last twelve months marking a turning point. This shift is fundamental to how agentic engineering systems will be designed going forward. Alibaba's Qwen3-Coder-Next landed in February with 3B active parameters over an 80B hybrid-MoE base. It reportedly scores over 70% on SWE-Bench Verified, comparable to models with ten to twenty times its active parameter count and comfortably runnable on a developer laptop. Microsoft shipped Phi-4-reasoning-vision-15B in March. Google followed in April with Gemma 4, a family that ranges from on-device 2B and 4B variants up to a 31B dense model currently sitting third on the Arena open-model leaderboard (at the time of writing).

NVIDIA's research team put the intellectual argument together last June: agent work is overwhelmingly repetitive, narrow and non-conversational, which makes small language models the correct default for it rather than a fallback. Twelve months on, that claim has gone from speculative to obvious. The bottom tier of a routed harness no longer has to live in someone else's data centre at all.

The next step in agentic product engineering: your model, on your code

Once an agentic AI system is comfortable routing, the next phase follows naturally - a core pattern in emerging agentic engineering systems. The most valuable model on the local tier is not a generalist. It is a model that knows your codebase, your internal APIs, your review patterns and your deployment conventions better than any public model ever will.

Three signals say this is tractable today, not a ten-year research bet:

Open weights have caught up to the frontier. Z.ai - Free AI Chatbot & Agent powered by GLM-5.1 & GLM-5's MIT-licensed GLM-5.1 topped SWE-Bench Pro this month, the first credible signal that "open weight" and "frontier-class" are no longer mutually exclusive. The foundation you fine-tune on your own code can now compete with anything the frontier labs will sell you access to.
Customer-specific training is productising. Poolside on AWS Bedrock packages the full training pipeline inside the enterprise perimeter, so fine-tuning on proprietary code stops being a one-off research project and becomes a managed service.
The proprietary-corpus playbook is already proven. BloombergGPT remains the canonical example that training on a private corpus can preserve general competence - the exact property you need the moment you try it on your own code.

Put the three together and the direction of travel is clear. Fine-tune open weights on your code today. Distil your own agent-sized models tomorrow. End state: a harness that routes the boring 80% of internal work to a model you own and reserves the frontier call for the genuinely hard 20%.

The honest counter-argument and the longer horizon

The sceptic's answer to all of this is "wait." Quality-adjusted inference prices are falling dramatically year on year, with some capability tiers dropping by multiple orders of magnitude in a single year. If the curve is that steep, the argument goes, any sophisticated routing infrastructure you build now will be over-engineered by the time it's finished. Just keep using the best model and ride the curve down.

We don't find this persuasive for two reasons:

Firstly, the ratio between tiers persists even as every tier cheapens. A frontier call will probably always be ten or more times the price of a small model call, so an organisation that routes compounds its savings regardless of where the absolute price sits.
The second is that today's curve sits on top of today's hardware. Neuromorphic processors like IBM's NorthPole and Intel's Loihi 2, photonic inference chips and novel memory architectures all point at further order-of-magnitude efficiency gains in the next wave and any one of them could reshape the economics again. But those deployments are still years from production at enterprise scale. Between now and then, orgs that route compound the savings. Orgs that don't, pay for the wait.

What we're telling our customers

The harnesses that matter in 2027 are unlikely to be distinguished by the underlying model's context windows. They're more likely to be distinguished by how cleverly they route and over time by whether at least one of those routes leads to a model the organisation trained itself. That future isn't evenly distributed yet.

Most harnesses in production still send every request to the same frontier model, and most of the customers we talk to are only beginning to ask the routing question. But the pieces are all on the board - the vendor pricing signals, the research, the small-model wave, the early distillation stories. The organisations that start thinking about it before the first awkward invoice will be better placed than those that have to retrofit it under pressure.

For a more detailed view of how organisations are approaching agentic engineering in practice - from adoption to cost control - see our 2026 Gen AI Playbook.

The race is on. Agentic engineering is about to get lean.

The first signs of cost pressure in agentic engineering

The cost pivot hasn’t landed yet

Routing is the architectural answer for agentic AI systems

Small models are reshaping the economics of agentic AI

The next step in agentic product engineering: your model, on your code

The honest counter-argument and the longer horizon

What we're telling our customers

Chris van Es

Related Articles

AI

AI in 2025: The terminal becomes cool again

AI

The future of software development? Agentic coding (and it’s vibey)

AI

LLMs - A smarter way to modernise legacy code?