The AI pricing honeymoon is over. How to tame your next token bill.

1 June 2026

The era of unlimited developer AI is ending. GitHub’s move to usage-based billing is only the latest sign that token costs are becoming impossible to ignore. The good news is that most teams already have the tools they need to control spend.

Urban Night Signboard

The era of unlimited developer AI is ending. GitHub made it official this week, moving every Copilot plan to usage-based billing but the direction of travel was clear long before that. As agents do more per turn, someone has to pay for the tokens they burn. Intelligent routing across a variety of models is ultimately where we are headed, but we're not quite there yet. So, what to do about token spend today with the tools already on your desk?

For now, the damage is contained by the human in the loop. An engineer prompting and approving each agent action is themselves the rate limiter, so the meter only moves as fast as they do. That brake disappears the moment work shifts to agent teams and swarms where dozens of agents run in parallel with nobody watching the meter.

Manage the context, not just the prompt

Context, not the prompt, is the real cost driver and the lever most teams misjudge. Every turn re-sends the entire conversation history which sounds ruinous until you account for prompt caching. Providers cache a stable prefix and bill the re-read at a fraction of the rate, a tenth of the input price on Anthropic, and modern coding agents set the breakpoints automatically (Claude Code does it for you) so you get the saving by default.

What breaks it is churn - switching models mid-task, editing earlier turns or letting the window fill with dead ends all invalidate the cached prefix and force a full-price rebuild. The hygiene follows from that: clear the context when you start a new task, compact when you need continuity, push file-by-file exploration into sub-agents that report back a summary and keep always-on instructions like CLAUDE.md lean because anything resident in the window is taxed on every single message.

Use what your harness already gives you

Most harnesses already include cost levers you may not be using. On Claude, the Advisor tool brings routing inside a single task, letting a cheaper model such as Haiku or Sonnet do the bulk of the work while Opus is consulted only at the strategic moments.

For teams that can use subscriptions over raw API access, Claude's Team plan swaps per-token metering for a fixed monthly cost though as the Copilot change shows no subscription's pricing is guaranteed to survive tomorrow's compute bill. And on Copilot itself, auto model selection is a win, matching each request to an appropriate model following cache boundaries and providing a 10% discount on token costs in auto mode.

Put a gateway between your team and the bill

In enterprise settings where subscriptions aren't an option and teams call the model APIs directly, the missing piece is a budget you can actually enforce. An AI gateway such as LiteLLM sits in front of every provider and tracks spend per key, user and team with hard caps and soft alerts, turning "unlimited developer AI spend" into a number someone owns.

Additionally, because it puts frontier and open-weight models behind a single endpoint, it doubles as the practical on-ramp to routing where choosing between cheap and capable models per task becomes a configuration change rather than an architecture project. 

One thing to verify rather than assume is caching since support and cache lifetime vary by model and region on Bedrock and Vertex, so watch your cache-read counts to confirm you aren't paying full price on every turn. These providers also default to the cheaper five-minute cache which expires over any real break, so if your engineers step away mid-session opting into the one-hour cache (the ENABLE_PROMPT_CACHING_1H setting in Claude Code) keeps the prefix warm rather than rebuilding it on every return.

The point

None of this is the future we wrote about. The future is harnesses that route intelligently across owned, open and frontier models without anyone thinking about it. The pieces are arriving faster than expected. DigitalOcean has just wired its Inference Router into OpenCode, bringing per-request routing to an open-source coding agent rather than locking it inside one vendor's tool. 

But that future isn't here just yet. Until routing is native everywhere, the win is hygiene: manage your context, pick the cheapest model that can do the job and put a guardrail on what gets spent. All three are cheap and available today, and the organisations that build the habit now will meet the routing era already disciplined rather than scrambling to retrofit it under pressure.

If you are staring at a developer AI bill that is climbing faster than you can explain to a board, that is precisely the conversation we are having with CTOs across financial services right now. We are happy to have it with you also.

Chris van Es

Head of AI

AI