Testing AI-based Systems is Difficult
Test strategies must be built according to each system's specific needs. Sometimes a helping hand can go a far way.
AI Test Guide
(Guidelines on Testing Machine Learning Systems)
ML(Machine Learning)-Based AI systems are typically complex (e.g. deep neural nets), are often based on big data, poorly specified and non-deterministic, which creates many new challenges and opportunities for testing them.
This AI Test Guide considers testing of AI model, input data, and development framework. Each testing focuses on test types to effectively find their related defect types. It also suggest related test completion criteria.
Contact us if you and your organization is seeking to establish a customized AI Test Strategy (Guidelines on Testing Machine Learning Systems) with your specific needs.
How this Guide is Organized
Introduction to AI and ML
Introduction to Machine Learning
Introduction to Machine Learning Testing
3.1 Introduction to Testing ML-Based Systems
3.2 Risk-Based Testing
3.3 ML Test Levels
3.4 ML Test Environments
4 Input Data Testing
4.2 Defect Types
4.3 Test Types
4.4 Test Completion Criteria
5.2 Defect Types
5.3 Test Types
5.4 Test Completion Criteria
Development Framework Testing
6.2 Defect Types
6.3 Test Types
6.4 Test Completion Criteria
Annex – Example of the Testing of an ML System
Annex – Introduction to Neural Networks
Annex - Characteristics of ML Systems
Functional Characteristics / Non-Functional Characteristics
Annex – Example ML Systems
Annex - Machine Learning Performance Metrics
Confusion Matrix / Accuracy / Precision / Recall / F1-Score / Aggregate Metrics / Other Supervised Classification Metrics / Supervised Regression Metrics / Unsupervised Clustering Metrics / Limitations of ML Functional Performance Metrics / Selection of Performance Metrics
Annex – Benchmarks for Various ML Domains
Annex - Documentation of an MLS
Typical Documentation Content / Example ML Model Documentation / Available Documentation Schemes
Annex – ML System Testing Checklists
This guide is focused on individuals with an interest in, or a need to perform, the testing of ML-Based systems, especially those working in areas such as autonomous systems, big data, retail, finance, engineering and IT services. This includes people in roles such as system testers, test analysts, test engineers, test consultants, test managers, user acceptance testers, business analysts and systems developers.
Failures and the Importance of Testing for Machine Learning Systems
There have already been a number of widely publicized failures of ML. According to a 2019 IDC Survey, “Most organizations reported some failures among their AI projects with a quarter of them reporting up to 50% failure rate; lack of skilled staff and unrealistic expectations were identified as the top reasons for failure.” [IDC 2019].
Example ML failures include:
IBM’s “Watson for Oncology” cancelled after $62 million spent due to “unsafe treatment” recommendations [IEEE 2019]
Microsoft’s AI Chatbot, Tay, was corrupted by Twitter trolls [Forbes 2016]
Joshua Brown died in a Tesla Model S on a bright day, when his car failed to spot a white 18-wheel truck/trailer [Reuters 2017]
Elaine Herzberg was killed crossing the street at 10pm with her bicycle in Arizona by an Uber self-driving car travelling at 38 mph [DF 2019]
Google searches showing high-paying jobs only to male users [WP 2015]
COMPAS AI-Based sentencing system in the US biased against African Americans [NS 2018]
Anti-Jaywalking system in Ningbo, China recognized a photo of a billionaire on a bus as a jaywalker [BBC 2018]
Failures have historically provided one of the most convincing drivers for performing adequate software testing. Industry surveys show a perception that ML is an important trend for software testing:
AI was rated the number one new technology that will be important to the testing world in the next 3 to 5 years. [SoTR 2019]
AI was rated second (by 49.9% of respondents) of all technologies that will be important to the software testing industry in the following 5 years [ISTQB 2018]
The most popular trends in software testing were AI, CI/CD, and Security (equal first). [LogiGear 2018]
Testing is already being performed on ML-based systems:
19% of respondent are already testing AI / Machine Learning [SoTR 2019]
57% of companies are experimenting with new testing approaches [WQR 2019]
The preferred term (or terms) for a given concept and are written in bold type. Alternative, less preferred synonyms, are written below the preferred terms in regular type, where applicable. Where a definition of a term applies to a specific domain, the domain precedes the definition (e.g. <machine learning>).
statistical testing approach that allows testers to determine which of two systems or components performs better
<neural network> output of an activation function of a node in a neural network
<neural network> the formula associated with a node in a neural network that determines the output of the node (activation value) from the inputs to the neuron
deliberate use of adversarial examples to cause an ML model to fail
Note 1: Typically targets ML models in the form of a neural network.
input to an ML model created by applying small perturbations to a working example that results in the model outputting an incorrect result with high confidence
Note 1: Typically applies to ML models in the form of a neural network.
testing approach based on the attempted creation and execution of adversarial examples to identify defects in an ML model
Note 1: Typically applied to ML models in the form of a neural network.
situation when a previously labelled AI system is no longer considered to be AI as technology advances
artificial intelligence (AI)
capability of a system to perform tasks that are generally associated with intelligent beings
[ISO/IEC 2382 – removed second option on AI as a discipline]
application specific integrated circuit
<machine learning> algorithm used to create an ML model from the training data
EXAMPLE: ML algorithms are used to generate models for Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, kNN, K-Means and Random Forest.
automated exploratory testing
form of exploratory testing supported by tools
system capable of working without human intervention for sustained periods
ability of a system to work for sustained periods without human intervention
approach to testing whereby an alternative version of the system is used as a pseudo-oracle to generate expected results for comparison from the same test inputs
EXAMPLE: The pseudo-oracle may be a system that already exists, a system developed by an independent team or a system implemented using a different programming language.
<neural network> method used in artificial neural networks to determine the weights to be used on the network connections based on the computed error at the output of the network
Note 1: It is used to train deep neural networks.
collection of benchmarks, where a benchmark is a set of tests used to compare the performance of alternatives
<machine learning> measure of the distance between the predicted value provided by the machine learning (ML) model and a desired fair prediction
<machine learning> machine learning function that predicts the output class for a given input
<machine learning> ML model used for classification
grouping of a set of objects such that objects in the same group (i.e. a cluster) are more similar to each other than to those in other clusters
black-box test design technique in which test cases are designed to execute specific combinations of values of several parameters
EXAMPLE: Pairwise testing, all combinations testing, each choice testing, base choice testing.
table used to describe the performance of a classifier on a set of test data for which the true and false values are known
<machine learning> part of the ML workflow that transforms raw data into a state ready for use by the ML algorithm to create the ML model
Note 1: Pre-processing can include analysis, normalization, filtering, reformatting, imputation, removal of outliers and duplicates, and ensuring the completeness of the data set.
<machine learning> supervised-learning model for which inference can be represented by traversing one or more tree-like structures
approach to creating rich hierarchical representations through the training of neural networks with one or more hidden layers
Note 1: Deep learning uses multi-layered networks of simple computing units (or “neurons”). In these neural networks each unit combines a set of input values to produce an output value, which in turn is passed on to other neurons downstream.
deep neural net
neural network with more than two layers
system which, given a particular set of inputs and starting state, will always produce the same set of outputs and final state
<machine learning> changes to ML model behaviour that occur over time
Note 1: These changes typically make predictions less accurate and may require the model to be re-trained with new data.
<AI> level of understanding how the AI-Based system came up with a given result
experience-based testing in which the tester spontaneously designs and executes tests based on the tester's existing relevant knowledge, prior exploration of the test item (including the results of previous tests), and heuristic "rules of thumb" regarding common software behaviours and types of failure
Note 1: Exploratory testing hunts for hidden properties (including hidden behaviours) that, while quite possibly benign by themselves, could interfere with other properties of the software under test, and so constitute a risk that the software will fail.
<machine learning> performance metric used to evaluate a classifier, which provides a balance (the harmonic average) between recall and precision
incorrect reporting of a failure when in reality it is a pass
Note1: This is also known as a Type II error.
EXAMPLE: The referee awards an offside when it was a goal and so reports a failure to score a goal when a goal was scored.
incorrect reporting of a pass when in reality it is a failure
Note1: This is also known as a Type I error.
EXAMPLE: The referee awards a goal that was offside and so should not have been awarded.
<machine learning> activity in which those attributes in the raw data that best represent the underlying relationships that should be appear in the model are identified for use in the training data
synonym: feature selection
<neural network> process of a neural network accepting an input and using the activation functions to pass a succession of values through the network layers to generate a predicted output
software testing approach in which high volumes of random (or near random) data, called fuzz, are used to generate inputs to the test item
graphical processing unit (GPU)
application-specific integrated circuit (ASIC) specialized for display functions such as rendering images
Note 1: GPUs are designed for parallel data processing of images with a single function, but this parallel processing is also useful for executing AI-Based software, such as neural networks.
<neural network> variables used to define the structure of a neural network and how it is trained
Note 1: Typically, hyperparameters are set by the developer of the model and may also be referred to as a tuning parameter.
<AI> level of understanding how the underlying (AI) technology works
process using computational techniques to enable systems to learn from data or experience
describes how a change in the test inputs from the source test case to the follow-up test case affects a change (or not) in the expected outputs from the source test case to the follow-up test case
testing where the expected results are not based on the specification but are instead extrapolated from previous actual results
<machine learning> output of a machine learning algorithm trained with a training data set that generates predictions using patterns in the input data
performance metric used to evaluate a classifier, which measures the proportion of classifications predictions that were correct
artificial neural network
network of primitive processing elements connected by weighted links with adjustable weights, in which each element produces a value by applying a nonlinear function to its input values, and transmits it to other elements or presents it as an output value
Note 1: Whereas some neural networks are intended to simulate the functioning of neurons in the nervous system, most neural networks are used in artificial intelligence as realizations of the connectionist model.
Note 2: Examples of nonlinear functions are a threshold function, a sigmoid function, and a polynomial function.
proportion of activated neurons divided by the total number of neurons in the neural network (normally expressed as a percentage) for a set of tests
Note 1: A neuron is considered to be activated if its activation value exceeds zero.
system which, given a particular set of inputs and starting state, will NOT always produce the same set of outputs and final state
<machine learning> generation of an ML model that corresponds too closely to the training data, resulting in a model that finds it difficult to generalize to new data
black-box test design technique in which test cases are designed to execute all possible discrete combinations of each pair of input parameters
NOTE 1: Pairwise testing is the most popular form of combinatorial testing.
<machine learning> parts of the model that are learnt from applying the training data to the algorithm
EXAMPLE: Learnt weights in a neural net.
Note 1: Typically, parameters are not set by the developer of the model.
parameterized test scenario
test scenario defined with one or more attributes that can be changed within given constraints
<machine learning> metrics used to evaluate ML models that are used for classification
EXAMPLE: Typical metrics include accuracy, precision, recall and F1-Score.
<machine learning> performance metric used to evaluate a classifier, which measures the proportion of predicted positives that were correct
<machine learning> machine learning function that results in a predicted target value for a given input
EXAMPLE: Includes classification and regression functions.
probabilistic software engineering
software engineering concerned with the solution of fuzzy and probabilistic problems
system whose behaviour is described in terms of probabilities, such that its outputs cannot be perfectly predicted
derived test oracle
independently derived variant of the test item used to generate results, which are compared with the results of the original test item based on the same test inputs
NOTE: Pseudo-oracles are a useful alternative when traditional test oracles are not available.
<AI> form of AI that generates conclusions from available information using logical techniques, such as deduction and induction
<machine learning> performance metric used to evaluate a classifier, which measures the proportion of actual positives that were predicted correctly
<machine learning> machine learning function that results in a numerical or continuous output value for a given input
standard promulgated by a regulatory agency
<machine learning> task of building an ML model using a process of trial and reward to achieve an objective
Note 1: A reinforcement learning task can include the training of a machine learning model in a way similar to supervised learning plus training on unlabelled inputs gathered during the operation phase of the AI system. Each time the model makes a prediction, a reward is calculated, and further trials are run to optimize the reward.
Note 2: In reinforcement learning, the objective, or definition of success, can be defined by the system designer.
Note 3: In reinforcement learning, the reward can be a calculated number that represents how close the AI system is to achieving the objective for a given trial.
activity performed by an agent to maximise its reward function to the detriment of meeting the original objective
programmed actuated mechanism with a degree of autonomy, moving within its environment, to perform intended tasks
Note 1: A robot includes the control system and interface of the control system.
Note 2: The classification of robot into industrial robot or service robot is done according to its intended application.
expectation that a system does not, under defined conditions, lead to a state in which human life, health, property, or the environment is endangered
Safety of the Intended Functionality (SOTIF)
ISO/PAS 21448: Safety of the Intended Functionality
<AI> algorithm that systematically visits a subset of all possible states (or structures) until the goal state (or structure) is reached
search based software engineering
software engineering that applies search techniques, such as genetic algorithms and simulated annealing to solve problems
adaptive system that changes its behaviour based on learning from the practice of trial and error
sign change coverage
proportion of neurons activated with both positive and negative activation values divided by the total number of neurons in the neural network (normally expressed as a percentage) for a set of tests
Note 1: An activation value of zero is considered to be a negative activation value.
coverage level achieved if by changing the sign of each neuron it can be shown to individually cause one neuron in the next layer to change sign while all other neurons in the next layer stay the same (i.e. they do not change sign)
<testing> device, computer program or system used during testing, which behaves or operates like a given system when provided with a set of controlled inputs.
digital entity that perceives its environment and takes actions that maximize its chance of successfully achieving its goals
<machine learning> task of learning a function that maps an input to an output based on labelled example input-output pairs
point in the future when technological advances are no longer controllable by humans
tensor processing units (TPU)
application-specific integrated circuit designed by Google for neural network machine learning
<machine learning> independent dataset used to provide an unbiased evaluation of the final, tuned ML model
source of information for determining whether a test has passed or failed
NOTE 1: The test oracle is often a specification used to generate expected results for individual test cases, but other sources may be used, such as comparing actual results with those of another similar program or system or asking a human expert.
test oracle problem
challenge of determining whether a test has passed or failed for a given set of test inputs and state
<neural networks> proportion of neurons exceeding a threshold activation value divided by the total number of neurons in the neural network (normally expressed as a percentage) for a set of tests
Note 1: A threshold activation value between 0 and 1 must be chosen as the threshold value.
<machine learning> dataset used to train an ML model
<AI> level of accessibility to the algorithm and data used by the AI-Based system
correct reporting of a failure when it is a failure
EXAMPLE: The referee correctly awards an offside and so reports a failure to score a goal.
correct reporting of a pass when it is a pass
EXAMPLE: The referee correctly awards a goal.
test by a human of a machine's ability to exhibit intelligent behaviour that is indistinguishable from human behaviour
<machine learning> generation of an ML model that does not reflect the underlying trend of the training data, resulting in a model that finds it difficult to make accurate predictions
<machine learning> task of learning a function that maps unlabelled input data to a latent representation
<machine learning> dataset used to evaluate a candidate ML model while tuning it
value change coverage
proportion of neurons activated where their activation values differ by more than a change amount divided by the total number of neurons in the neural network (normally expressed as a percentage) for a set of tests
virtual test environment
test environment where one or more parts are digitally simulated