When the algorithm gets it wrong - why testing AI is not optional
29 July 2025
Today, we place huge trust in the systems and algorithms we rely on to guide our daily lives. But what happens when that trust is misplaced - when systems behaves confidently, yet dangerously wrong?
This isn't a hypothetical. It’s a growing reality as ML and AI move from novelty to critical infrastructure, and as we become more reliant on the solutions they offer.
As engineers, we therefore need to ask harder questions - not just about what works, but what fails, and how silently it might be doing so.

In pump we trust
A few weeks ago, I finally caved. After 15 years of avoiding the tech, I started on a closed-loop insulin pump. This thing's pretty clever: an algorithm monitors my blood sugar, learns my patterns and adjusts insulin doses automatically. Almost like a pancreas. Neat.
First 48 hours? Dreamy.
By hour 72? Nightmare.
Blood sugar crept up for no obvious reason. The pump claimed it had delivered hefty doses. Yet nothing happened. I flipped back to old-school manual injections, and one dose brought it under control almost instantly. Twelve hours later? Same again. The diabetes nurses calmly reassured me: "This happens sometimes. You need to trust the algorithm. It'll settle."
Spoiler: it happened again a week later.
Parallel lines
Now, I'm an engineer. My instinct isn't blind trust… it's system thinking. Was it hardware? Cannula? Placement? Insulin? All ruled out. What remained was software: learning, but sometimes learning the wrong thing. And yet, apparently, "these blips happen".
The parallel to our world of software engineering is uncomfortable. Because we do this too.
The promise of machine learning often sounds like an excuse to reduce the burden of rigorous testing. After all, it'll improve itself, right? But self-correcting doesn't mean self-validating. In my case, the software thought it was doing a great job while very nearly landing me in DKA.
Testing isn't optional… even for ML
In traditional software, bad tests are obvious. In ML systems, the model might silently learn a distorted reality based on edge cases, incomplete data or rare patterns it wasn't trained for. And in critical systems (medical, financial, legal) that's terrifying.
It's easy to pretend this is a "healthcare tech" problem. But we've seen it closer to home. Think of the Post Office Horizon scandal - faulty software, combined with unchallenged assumptions, ruined hundreds of lives.
As Cathy O'Neil demonstrates in Weapons of Math Destruction, algorithmic systems can perpetuate and amplify bias precisely because we assume they're objective. The Google research on "Hidden Technical Debt in Machine Learning Systems" shows how ML complexity creates testing blind spots we barely recognise.
Of course, some like James Coplien argue that we can over-engineer our safety nets. But as an industry, can we honestly say we test rigorously enough to hold software that accountable when lives hang in the balance?
The death of "the tester's job"
The old model was easy: developers write, testers break. But modern teams- especially those doing CI/CD, DevOps or AI -need everyone owning quality. TDD (Test Driven Development) gives us guardrails while writing code. BDD (Behaviour Driven Development) helps capture intent early with the likes of Gherkin or Cucumber scenarios that both humans and machines can validate.
The problem is, not enough teams really adopt them properly. We half-do TDD ("I wrote a unit test… somewhere."). We treat BDD as paperwork. And we still quietly hope QA will catch it if we miss it.
Steve Freeman and Nat Pryce's Growing Object-Oriented Software, Guided by Tests remains the gold standard for why this matters. But Dan North reminds us that behaviour-driven development is really about shared understanding, not just test automation. The famous "TDD is Dead" debate between DHH, Kent Beck and Martin Fowler shows there's no consensus on how far to take it.
The industry has made quality a team sport, but we haven't fully trained the team.
Testing AI itself: A different layer of risk
And then there's the layer we're increasingly layering on top: AI coding assistants.
Yes, Copilot and others can write tests. But as Feng et al. show, AI-generated tests can create a feedback loop of false confidence - tests that check code written by the same AI may simply prove their thinking right.
As Stack Overflow suggests in their analysis of how we use AI coding tools, if we don't fully understand the code or the tests it spits out, we're automating both our productivity and our blind spots. The paper "On the Dangers of Stochastic Parrots" by Bender and colleagues warns about deploying AI systems in critical contexts without understanding their fundamental limitations.
Yet Elite Brain's research suggests a more optimistic view: that AI tools can augment human judgement rather than replace it. The question is whether we're disciplined enough to use them that way.
Code generation isn't a substitute for engineering judgment. It's an accelerant for our understanding or misunderstanding. The tests we write reflect our understanding. Weak tests = weak understanding.
Drawing the line: When testing becomes theatre
The uncomfortable truth: there's a point where more testing becomes security theatre… ritualistic activity that feels productive but adds little real protection. Testing every possible combination of user inputs might catch edge cases, but it won't catch the systemic thinking errors that create real failures.
The key is distinguishing between comprehensive coverage and comprehensible risk. A payment system needs exhaustive boundary testing around currency calculations. A content management system probably doesn't need to test what happens when someone uploads a file named with 10,000 characters, but it should definitely test what happens when the database goes down mid-save.
Michael Nygard's Release It! offers a masterclass in thinking about failure modes that actually matter. Meanwhile, Diane Vaughan's work on "Normalization of Deviance" shows how organisations gradually accept increasing risk by rationalising away warning signs. The Zero Defect Software movement pushes back, arguing for maximum testing rigour, but even they acknowledge resource constraints.
Smart testing focuses on impact, not coverage metrics. Ask: what failure modes would actually matter to users? What assumptions, if wrong, would cause the most damage? Sometimes the most important test is the one that challenges the fundamental premise of what you're building.
Reasonable boundaries, relentless curiosity
Of course, we can't test everything. The combinatorial explosion of scenarios would consume infinite time. But that's where good engineers still matter:
We think about what could go wrong.
We question the edge cases.
We model the failure modes.
We simulate chaos (yes, even in non-distributed systems).
We don't assume "happy path is fine" means "ship it".
We also question the design itself: should this system be fully autonomous? Should there always be manual override? Should the user even trust it?
The hard questions we avoid
These challenges aren't abstract. They demand answers from every team shipping software:
Do we really test for failure as much as success?
Are our acceptance criteria actually written from the user's risk perspective- or just from feature sign-off?
Are we documenting what "the system must never do"?
Should ML/AI systems have stricter test regimes than deterministic code?
Are we creating sufficient observability for when (not if) things go wrong?
The answers matter more than we'd like to admit. Because increasingly, our software isn't just moving pixels… it's making decisions that affect real lives.
And maybe most importantly: would we trust our software with someone's health, money or freedom? Because increasingly, that's where our industry is headed.
Until next time: test boldly, question assumptions and don't blindly trust the algorithm.

Andrew Paul
Software Engineering Trainer