Overview
Testing is the bedrock upon which we build software. Without tests, code will not work. Without tests, code cannot be updated with any level of confidence. Simple as.
But how do you test a system that gives different answers each time? How do you write assertions for outputs that vary? How do you ensure regression testing works when the AI model itself evolves? Traditional testing approaches struggle with the non-deterministic nature of LLM-powered applications.
This intensive two-day workshop brings test thinking to LLM applications through practical approaches that work in production. You'll learn to validate AI-powered features using evaluation frameworks, automated testing strategies and observability patterns that verify applications work correctly both during development and in production.
Through hands-on exercises with real AI systems, you'll develop techniques for testing everything from simple text generation to complex multi-modal interactions like voice assistants. We'll cover the full testing pyramid for AI applications: from unit tests that mock LLM behaviour to integration tests with real APIs to production monitoring that catches issues before users do.
By the end of this workshop, you'll have practical frameworks and working code examples for testing AI applications confidently, knowing they continue to behave correctly as models, prompts and data evolve.
Outline
Foundations of AI application testing
- Why traditional testing approaches struggle with LLMs
- Understanding non-determinism and its implications
- The testing pyramid for AI-powered applications
- When to test behaviour vs exact outputs
- Building confidence without perfect reproducibility
Unit testing LLM integrations
- Mocking LLM APIs for fast, reliable tests
- Testing prompt templates and construction logic
- Validating input sanitisation and output parsing
- Testing error handling and fallback behaviour
Evaluation frameworks and metrics
- Introduction to LLM evaluation frameworks
- Defining success criteria for AI outputs
- Automated evaluation metrics: relevance, coherence, factuality
- Building custom evaluators for domain-specific requirements
Integration testing with real LLMs
- Testing against actual LLM APIs
- Managing API costs during testing
- Snapshot testing for LLM outputs
- Testing conversation flows and multi-turn interactions
Testing RAG systems
- Evaluating retrieval quality and relevance
- Testing embedding generation and vector search
- Validating context injection and prompt construction
- End-to-end RAG system testing
Testing tool calling and agentic systems
- Validating tool selection and parameter extraction
- Testing multi-step reasoning and planning
- Evaluating agent decision-making
- Testing error recovery behaviour
Testing complex AI interactions
- Automated testing strategies for voice assistants
- Testing multi-modal AI systems
- Validating conversation context and memory
- Testing personalisation and user-specific behaviour
Regression testing for evolving AI
- Detecting breaking changes in AI behaviour
- Testing across model versions and updates
- Prompt evolution and backwards compatibility
- Building regression test suites for AI features
Production observability and monitoring
- What to monitor in production AI systems
- Logging strategies for LLM applications
- Building evaluation pipelines for production data
- Real-time quality monitoring and alerting
- User feedback integration
Continuous evaluation in production
- Running evals against production logs and data
- Detecting degradation and drift
- A/B testing AI features and prompts
- Synthetic monitoring and canary testing
CI/CD integration for AI features
- Automating LLM tests in CI pipelines
- Managing test data and fixtures
- Quality gates for AI deployments
- Pre-deployment validation strategies
Performance and load testing
- Load testing LLM-powered features
- Measuring and optimising latency
- Testing under rate limits and quotas
- Graceful degradation testing
Building test infrastructure
- Test data generation for AI applications
- Synthetic data and scenario creation
- Fixture management for LLM tests
- Test environment configuration
Quality Assurance workflows
- Manual testing approaches for AI features
- Exploratory testing techniques for LLMs
- User acceptance testing for AI systems
- Red teaming and adversarial testing
- Building feedback loops from testing to development
Requirements
This intermediate two-day course is designed for software engineers, QA engineers and DevOps professionals working with AI-powered applications. Participants should have at least 6 months of experience in software testing and be comfortable with basic testing concepts.
Familiarity with LLM applications is beneficial - ideally participants will have taken our Building GenAI Applications course or have equivalent experience building or working with AI features. Understanding of CI/CD concepts and experience with testing frameworks in any language is expected.
Participants must bring laptops with development environments configured for their preferred language (Python, TypeScript, Java). We'll provide instructions for setting up evaluation frameworks and testing tools prior to the course.
Some exercises will require API access to LLM providers. We can provide limited API credits for the course or participants can use their own accounts.
Bringing examples of AI features from your own applications significantly enhances the practical value of this course, as you'll be able to apply testing techniques to real systems you're working with.