Testing in a GenAI World

Your team will learn...

Apply testing principles to non-deterministic AI-powered features

Implement automated evaluation frameworks for LLM outputs

Build observability patterns for production AI systems

Test complex AI interactions including voice assistants

Design regression tests for evolving AI behaviour

Integrate AI testing into CI/CD pipelines

Monitor and verify AI applications in production

Overview

Testing is the bedrock upon which we build software. Without tests, code will not work. Without tests, code cannot be updated with any level of confidence. Simple as.

But how do you test a system that gives different answers each time? How do you write assertions for outputs that vary? How do you ensure regression testing works when the AI model itself evolves? Traditional testing approaches struggle with the non-deterministic nature of LLM-powered applications.

This intensive two-day workshop brings test thinking to LLM applications through practical approaches that work in production. You'll learn to validate AI-powered features using evaluation frameworks, automated testing strategies and observability patterns that verify applications work correctly both during development and in production.

Through hands-on exercises with real AI systems, you'll develop techniques for testing everything from simple text generation to complex multi-modal interactions like voice assistants. We'll cover the full testing pyramid for AI applications: from unit tests that mock LLM behaviour to integration tests with real APIs to production monitoring that catches issues before users do.

By the end of this workshop, you'll have practical frameworks and working code examples for testing AI applications confidently, knowing they continue to behave correctly as models, prompts and data evolve.

Outline

Foundations of AI application testing

Why traditional testing approaches struggle with LLMs
Understanding non-determinism and its implications
The testing pyramid for AI-powered applications
When to test behaviour vs exact outputs
Building confidence without perfect reproducibility

Unit testing LLM integrations

Mocking LLM APIs for fast, reliable tests
Testing prompt templates and construction logic
Validating input sanitisation and output parsing
Testing error handling and fallback behaviour

Evaluation frameworks and metrics

Introduction to LLM evaluation frameworks
Defining success criteria for AI outputs
Automated evaluation metrics: relevance, coherence, factuality
Building custom evaluators for domain-specific requirements

Integration testing with real LLMs

Testing against actual LLM APIs
Managing API costs during testing
Snapshot testing for LLM outputs
Testing conversation flows and multi-turn interactions

Testing RAG systems

Evaluating retrieval quality and relevance
Testing embedding generation and vector search
Validating context injection and prompt construction
End-to-end RAG system testing

Testing tool calling and agentic systems

Validating tool selection and parameter extraction
Testing multi-step reasoning and planning
Evaluating agent decision-making
Testing error recovery behaviour

Testing complex AI interactions

Automated testing strategies for voice assistants
Testing multi-modal AI systems
Validating conversation context and memory
Testing personalisation and user-specific behaviour

Regression testing for evolving AI

Detecting breaking changes in AI behaviour
Testing across model versions and updates
Prompt evolution and backwards compatibility
Building regression test suites for AI features

Production observability and monitoring

What to monitor in production AI systems
Logging strategies for LLM applications
Building evaluation pipelines for production data
Real-time quality monitoring and alerting
User feedback integration

Continuous evaluation in production

Running evals against production logs and data
Detecting degradation and drift
A/B testing AI features and prompts
Synthetic monitoring and canary testing

CI/CD integration for AI features

Automating LLM tests in CI pipelines
Managing test data and fixtures
Quality gates for AI deployments
Pre-deployment validation strategies

Performance and load testing

Load testing LLM-powered features
Measuring and optimising latency
Testing under rate limits and quotas
Graceful degradation testing

Building test infrastructure

Test data generation for AI applications
Synthetic data and scenario creation
Fixture management for LLM tests
Test environment configuration

Quality Assurance workflows

Manual testing approaches for AI features
Exploratory testing techniques for LLMs
User acceptance testing for AI systems
Red teaming and adversarial testing
Building feedback loops from testing to development

Requirements

This intermediate two-day course is designed for software engineers, QA engineers and DevOps professionals working with AI-powered applications. Participants should have at least 6 months of experience in software testing and be comfortable with basic testing concepts.

Familiarity with LLM applications is beneficial - ideally participants will have taken our Building GenAI Applications course or have equivalent experience building or working with AI features. Understanding of CI/CD concepts and experience with testing frameworks in any language is expected.

Participants must bring laptops with development environments configured for their preferred language (Python, TypeScript, Java). We'll provide instructions for setting up evaluation frameworks and testing tools prior to the course.

Some exercises will require API access to LLM providers. We can provide limited API credits for the course or participants can use their own accounts.

Bringing examples of AI features from your own applications significantly enhances the practical value of this course, as you'll be able to apply testing techniques to real systems you're working with.

COURSE

Testing in a GenAI World

Master testing techniques for LLM applications, from automated evaluation frameworks to production observability patterns.

2 Days
Intermediate
In-person / Online
£ On Request

Book for my team

“Solved our biggest AI testing challenges”

We had been struggling with how to test our LLM features reliably. This course gave us concrete frameworks and patterns we implemented immediately. The observability section alone transformed how we monitor AI behaviour in production.

“Practical techniques we could use right away”

The evaluation frameworks and automated testing strategies were exactly what we needed. Within two weeks we had proper test coverage for our AI features and confidence to ship changes faster. The voice assistant testing examples were particularly valuable.

“Bridged the gap between traditional and AI testing”

The course perfectly balanced traditional software testing principles with AI-specific challenges. Learning how to bring determinism to non-deterministic systems fundamentally changed our approach to quality assurance.

Related Courses

COURSE

Building GenAI Applications

COURSE

Modern Testing Workshop

COURSE

Architecting Secure LLM Applications

View All Courses