WA-Testing-Tool

Scripts that run against Watson Assistant for

KFOLD K fold cross validation on training set,
BLIND Evaluating a blind test, and
TEST Testing the WA against a list of utterances.

In the case of a k-fold cross validation, or a blind set, the tool will output a precision curve, in addition to per-intent precision and recall rates, and a confusion matrix.

Features

Easy to setup in one configuration file.
Save the state when Assistant service is down in the middle of processing.
Able to resume from where it stops using modularized scripts.

Prerequisite

Python 3.6.4 +
Mac users: you may need to initialize Python's SSL certificate store by running Install Certificates.command found in /Applications/Python. See more here
Git client

Quick Start

Pre-work: Make sure to cd into the location of a projects folder, where you will clone this github repo. Within the folder, cd into the WA-Testing-Tool folder.

Install code git clone https://github.com/cognitive-catalyst/WA-Testing-Tool.git
Install dependencies pip3 install --upgrade -r requirements.txt
Set up parameters properly in configuration file (ex: config.ini). Use config.ini.sample to bootstrap your configuration. a. In your terminal, copy the config file into a new one, cp config.ini.sample config.ini b. Open the config.ini file in your favorite text editor, edit and save the following information with your actual credentials: API Key url workspace_id (Watson Assistant v1) or environment_id (Watson Assistant v2) c. Set the mode and the mode-specific parameters.
Run the process. python3 run.py -c config.ini or python3 run.py -c <path to your config file>

Quick Update

If you have already installed this utility use these steps to get the latest code.

Upgrade dependencies pip3 install --upgrade -r requirements.txt
Update to latest code level git pull

Input Files

config.ini - Configuration file for run.py. This is formatted differently for each mode. Review the Examples below to explore the possible modes and how each is configured.

test_input_file.csv - Test set for blind testing and standard test.

For blind test with golden intent used for comparison:

utterance	golden intent
utterance 0	intent 0
utterance 1	intent 0
utterance 2	intent 1

For standard test, the input must only have one column or error will be thrown:

utterance
utterance 0
utterance 1
utterance 2

Examples

There are a variety of ways to use this tool. Primarily you will execute a k-folds, blind, or standard test.

More examples

Long-form resources available in Article and Video form:

Title	Article	Video
Testing a Chatbot with k-folds Cross Validation	https://medium.com/ibm-watson/testing-a-chatbot-with-k-folds-cross-validation-68dab111a6b	https://www.youtube.com/watch?v=FrhK68WyOK4
Analyze chatbot classifier performance from logs	https://medium.com/ibm-watson/analyze-chatbot-classifier-performance-from-logs-e9cf2c7ca8fd	https://www.youtube.com/watch?v=yd89DKyf6hc
Improve a chatbot classifier with production data	https://medium.com/ibm-watson/improve-a-chatbot-classifier-with-production-data-22a437f419b4	https://www.youtube.com/watch?v=ftFIQtHiQY8

Related projects

Watson Assistant is commonly paired with IBM Speech services to build voice-driven Conversational AI solutions. Check out these tools to assess and tune your speech models!

STT-WER-Python: Utilities for testing IBM Speech to Text
TTS-Python: Utilities for testing IBM Text to Speech

Testing Natural Language Understanding Classifier

This tool can also be used to test a trained Natural Language Understanding (NLU) Classifier. The configuration is similar to testing Watson Assistant except:

Use the NLU URL in the url parameter (ex: https://api.us-south.natural-language-understanding.watson.cloud.ibm.com)
Specify the <model_id> in the workspace_id parameter in the configuration
Since NLU classifier does not support downloading training data, the original training data must be provided if run in 'kfold' mode (using the train_input_file parameter)

General Caveats and Troubleshooting

Due to different coverage among service plans, user may need to adjust max_test_rate accordingly to avoid network connection error.
Users on Lite plans are only able to create 5 workspaces. They should set fold_num=3 on their k-fold configuration file.
In case of interrupted execution, the tool may not be able to clean up the workspaces it creates. In this case you will need to manually delete the extra workspaces.
Workspace ID is not the Skill ID. In the Watson Assistant user interface, the Workspace ID can be found on the Skills tab, clicking the three dots (top-right of skill), and choosing View API Details.
SSL: [CERTIFICATE_VERIFY_FAILED] on Mac means you may need to initialize Python's SSL certificate store by running Install Certificates.command found in /Applications/Python. See more here
"This utility used to work and now it doesn't." Upgrade to latest dependencies with pip3 install --upgrade -r requirements.txt and latest code with git pull.
If you get a Python module loading error, confirm that you are using matching pip and python version, ie pip3 and python3 or pip and python.
Watson Assistant v2 configuration does not support k-folds mode. Watson Assistant v2 is tested "in-place" rather than creating temporary skills for this tool. Actions users may prefer to use Dialog Skill Analysis notebooks - these notebooks have additional capabilities for analyzing Dialog or Action skills.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WA-Testing-Tool

Features

Prerequisite

Quick Start

Quick Update

Input Files

Examples

Core execution modes

Extended modes (executed by default)

Extended modes

More examples

Related projects

Testing Natural Language Understanding Classifier

General Caveats and Troubleshooting

Files

README.md

Latest commit

History

README.md

File metadata and controls

WA-Testing-Tool

Features

Prerequisite

Quick Start

Quick Update

Input Files

Examples

Core execution modes

Extended modes (executed by default)

Extended modes

More examples

Related projects

Testing Natural Language Understanding Classifier

General Caveats and Troubleshooting