Testing a chatbot or conversational AI is different from traditional software UAT. Here are 7 rules to keep in mind.
Before deployment, conversational AI solutions must be tested to check that the bot responds appropriately to user queries. This is essential to ensure that the bot will meet its intended goals as per the scope and that all stakeholders involved are aware of what to expect.
Testing can be done through running statistical cross-validation techniques, a batch “blind” test, a randomised manual log review or any combination of the three. In a batch test, the bot is tested with real questions outside the training sample to see if the responses are accurate and helpful. In a randomised log review, human evaluators go through a chat log to mark whether the bot responses were correct and helpful to address the user queries.
Here, we outline some key guiding principles and best practices to follow, in order to ensure that the results are satisfactory and to meet stakeholder expectations.
Most organisations will have some familiarity with implementing traditional software. These involve manually programming a rule-based algorithm which produces the result when users input certain parameters. The same input always gives the same output. Take for example a simple shopping cart software that lets users pick products from an e-commerce website, save them, and check out. The features of the software have clearly defined paths to perform under certain specific scenarios and those scenarios only. No one reasonably expects the user to take any action outside the usual browse, compare, add to cart and check-out processes.
The users are also in general aware of how to navigate and interact with the interfaces in general or else they are onboarded or trained. Moreover the cases and scenarios used to test the software are these exact same ones that users go through in real life. Failures in testing are often due to bugs in the code which engineers can correct before deployment.
This means that edge cases, which are situations that only occur at extreme conditions, are rare. However, given the nature of the interface of many conversation AI systems that typically provide a text box or a microphone input for voice, users are free to input whatever they want. As entrepreneurs Martin Casado and Matt Bornstein at the venture capital firm Andreessen Horowitz put it clearly, “Users can – and will – enter just about anything into an AI app.”
If not given the right guidance, it is common for users to input questions in domains that the AI system has not been trained for. They then feel disappointed at the seeming lack of "smartness" of the system. In this sense, the very term ‘edge cases’ takes on a different meaning since these can occur just about anytime and anywhere.
Unlike traditional software, conversational AI systems employ a probabilistic model that learns patterns from the data used for training.
Hence, it is important to note the crucial difference with artificial intelligence. The algorithm is not manually defined with specific lines of codes by software engineers but learnt from the data. Specifically with systems like conversational AI, which employs Natural Language Processing (NLP) techniques, a probabilistic model learns patterns from the data that it is trained with.
These systems are designed to handle untrained users and the input can vary within the scope that the bot is trained for. In fact, you can expect the training data and real life data to have a certain degree of variance, as it is not possible to be exhaustive in collecting the infinite number of ways users can phrase their queries. This difference should also be mirrored in the testing, where the bot is tested with real user queries.
Lastly the concept of bugs do not directly apply to AI software. Just because an AI conversation agent gets one question wrong, does not mean that there is a bug. Instead, the goal here is to optimise the probabilistic model to improve the confidence level on an ongoing basis. In layperson terms, it will be to increase the ability of the model to provide more certain (i.e. the model is more certain whether it knows the answer or not) and more correct predictions over time.
Due to the way it differs from traditional software, conversational AI solutions are best implemented as an ongoing process. In traditional software, once the deployment and testing are done, the project is generally considered closed and handed over to the client. If the testing shows that all the parts of the software works perfectly and if the right inputs give the right expected outputs, there is nothing more to be done.
In AI-based software, the aim is for the chatbot to be able to accurately answer all of the user’s questions in the future. But since it is impossible to know how user queries will change in the future, conversational AI testing and implementation needs to be an ongoing process. This can possibly be in decreasing intensity if the base knowledge does not change or expand.
Specifically, the goal is to iteratively improve with each round in a "Test-Analyse-Action’ approach. Upon launch or during the initial phase, we recommended implementing a “Hyper-care” phase. Depending on your target user group, determine a sample size that is statistically significant, and have trainers go through the sample size of data daily for one to two weeks. Subsequently, they can reduce the frequency, depending on how often the knowledge base evolves.
As the model stabilises, a recommended frequency could be once a month. Factor in added time if the review process involves users outside of the project team. These could be an extended group of Subject Matter Experts, usually members from the business teams. Keep in mind that some actions can produce more improvement than others, so it is important to keep to the testing schedule. In short, done is better than perfect.
|PRO TIP: Conduct repeated rounds of “Test-Analyse-Action” every week and observe the improvements in every round.|
There are many techniques that can be used, depending on who is using the tools.
At the very least, one way to test is to ensure that all the training data in your model is correctly predicted by your model. This means that if an utterance “How do I file a claim for my medical insurance” is under the intent “Claim_Medical_Insurance”, the model should correctly point to this intent.
Next, it is important to understand the two main types of tests:
Machine learning models are tested using a statistical method called cross-validation. This involves testing the model’s ability to predict new data different from that which was used to train it. In conversational AI systems, this means testing the bot with queries outside the set of examples that was used for its training.
Common methodologies include K-fold and Leave-one-out cross validation (LOOCV). The K-fold method can be thought of as a way of splitting the data set into k groups, with one of them used for testing the model while the rest (k-1) used for training. This is repeated k times with each of the split sets taking turns to be the one used for testing.
LOOCV is a more exhaustive method, where the model is tested on all possible ways that the original sample can be split into training and testing sets. It is computationally less expensive, and is more suitable for small training data sets. It is recommended to use the cross validation techniques before moving on to conducting blind tests.
Blind tests involve test data with utterances or questions that users may ask, alongside the corresponding correct answer. When these questions are run through the model via a batch test, each one is marked as to whether the model prediction was correct or not. In an ideal world, the test set should comprise questions that reflect the “ground truth” (see below for the definition). But in the absence of such data, we have developed guidelines that can help you leverage your users to create meaningful test data.
Regardless of the techniques used, it is important to identify the action steps depending on the outcome. Data visualisation tools can help users better understand how similarly or differently the model interprets their data by showing them as being close or far away.
A confusion matrix can also be helpful in showing which intents the model predicts, so that the NLP trainer can identify patterns and combine or retrain the intents if needed.
Not all projects run through both test types. The choice depends on the chatbot developer’s understanding and ability to conduct the tests. At the end of the day, a review of chat logs will still be a process that can help you understand the conversational AI agent’s performance.
The testing of the conversational AI and therefore the success of the solution depends heavily on the data set chosen. It is critical to keep the following guidelines in mind when preparing the AI blind test set. These steps will help you develop test data that can cover a wide range of scenarios, yet stay within the scope of the knowledge base.
A helpful way of visualising this is in a table as follows.
User wants to book an appointment for a health screening
“Want to make health check up appointment"
Sure, let’s find a time slot for you.
What is a good day and time for you?
We often find that the test data set is not up to par when it comes to training the bot. This happens because of some common mistakes like these:
It may also be the case that certain business-facing teams prepare refined queries based on what they think a user might ask instead of how users ask the questions in the real world. Think of how someone might ask the question via chat if at all you must create such queries at all.
The testing should commence only with a common understanding of what the targeted goal ought to be. If the goal is to achieve a suitable level of “smartness”, look for improvements with each iteration.
We have all experienced a call centre agent who is unable to help us resolve the issue at hand. However, it is reasonable to expect that they have received training and would get an answer correct at least 70% of the time. So the aim for the test set should be to reach this benchmark and go beyond. It is common to get there in three to four rounds of iteration, with 30%, 40-50%, 50-60%, 60-70% and above achieved in each successive round.
No software deployment project is complete without a constant monitoring of the analytics. But what are the specific metrics you will track? This is extremely important in the ongoing iterative review and testing of the chatbots’ performance.
Set up your analytics to track the Coverage Ratio. This lets you know that of the questions users ask, how many of them were the AI virtual assistant trained for and hence able to answer correctly. If the coverage ratio is high, say at least 70-80% and above, it means that the questions you chose to train the chatbot are well selected and are representative of what real users ask. If it is significantly lower than that, it may mean that some questions you have trained the chatbot for are not what real users ask. As such, you should remove some and add in other more relevant questions that people need help with.
Having too few examples per intent and having similar utterances classified into different intents are the two most common culprits of poor predictions. To train the bot appropriately within the scope, it is important to collect good quality examples for each intent. As a rule, aim for 10 to 20 examples per intent.
At the end of the day, testing is the means by which the essential desired qualities of the conversational AI can be improved continuously, be it smartness, personality, or lead generation. Getting the fundamentals right before starting the process can orient the bot in the right direction.
For more tips on how to set up your organisation for conversational AI implementation success, get in touch today.