Types of data sets used to train and test the cognitive service


AI Service Management (Categorization and Classification) provides auto-categorization, auto-assignment, and chatbot capabilities. To utilize these capabilities in an application, you must train the IBM Watson Assistant Service to work with your data. A developer might create a sample training data set for your application. An administrator updates the sample data set or creates new training data sets. The administrator then uses the training data sets in existing or new processes or rules for the application. 

In addition to training the cognitive service, you can test whether the cognitive service is trained correctly so that it predicts the desired categories. You can evaluate the cognitive service against the following test metrics: 

  • Accuracy—Accuracy is the ratio of number of correct predictions to the total number of input samples.
    For example, if the test results indicate that 9 out of 10 variations of increase RAM request are correctly predicted, the accuracy is 9/10 = 0.9.
  • Recall—Recall is the number of correct positive results divided by the number of positive results predicted by the cognitive service. 
    For example, for a search query that contains increase RAM, if the system returns 10 results that contain both increase and RAM and 8 of those results include the phrase increase RAM, the precision is 8 out of 10. If 20 more instances are related to increase RAM, the recall is 8 out of 30.
    Higher recall indicates higher viability of the data sets.
  • Precision—Precision is the number of correct positive results divided by the total number of relevant samples.
    For example, for a search query that contains increase RAM, the system returns 10 results that contain both increase and RAM, and 8 of those results include the phrase increase RAMIn this case, the precision is 8/ 10 = 0.8.
    Higher precision indicates higher viability of the data sets.
  • F-score—F-score is the harmonic average of precision and recall. F-score reaches its best value at 1 (indicating perfect precision and recall) and worst at 0. 
    Traditionally, F-score is calculated as F = 2 × (Precision × Recall) / (Precision + Recall)

Higher precision and recall indicate higher viability of the data sets. For more information about how test metrics are calculated, see FAQ.

Additional information about machine learning metrics

The following blogs provide more information about machine learning metrics and macro versus micro average of precision. BMC does not endorse the information in these external links. This information provided in these links should be used for reference purposes only.

For more information about testing the chatbot application, see BMC Helix Virtual Agentdocumentation.

Training data set and test data set

After creating a CSV file, administrators can specify the percentage to split the rows into training data and test data. For new data sets, by default, random 80% of the rows are used as training data and the remaining 20% are used as test data. For existing data sets, by default, 100% of the rows are used as training data. 

The following table explains the difference between training data sets and test data sets:

Training data set

Test data set

Used to train the cognitive service so that the input text is correctly categorized.

Used to test whether the cognitive service correctly categorizes the input text based on the training data.

Can be a CSV file or runtime application data.

  • Always is a CSV file.
  • The test results are generated in a CSV file or can be viewed from BMC Helix Innovation Studio.

Benefits

BMC Helix Innovation Studio provides a tool to test the cognitive service. These tests are particularly important when you are implementing a new training data set or if you have made major changes to the data sets. Testing the cognitive service and chatbot has the following benefits:

  • You do not require prior knowledge of data science to use this tool. 
  • Helps to evaluate the cognitive service on the basis of standard machine learning algorithms. 
  • Helps identify the exact area of the problem so that you can rectify the data sets to improve the performance of the cognitive service.
  • Provides a history of the test results. 

CSV data set and application data set

You can create a CSV file of the data set or specify application data for training and testing AI Service Management (Categorization and Classification) capabilities:

Type of data set

Description

Reference

CSV data set

A CSV file is used to train and test the cognitive service. The CSV file contains an input text to be categorized and the category for the input text.

The CSV file must have at least 5 and not more than 25,000 categories associated with 2,000 input texts.

For applications that do not require continuous cognitive service training, you can train and test the cognitive service by using data from CSV files.

BMC Helix Innovation Studio data set

The application record definitions and record fields are used to train and test the cognitive service.

For applications that are used in a continuously changing business environment that requires frequent training of the cognitive service, you can use application data to train the cognitive service. This approach helps the cognitive service get updated data and provide suggestions according to the business changes.

Not applicable

 CSV data set structure

 A CSV file that is used for training and testing the cognitive service should have the following structure:  

  • A row represents a record.
  • The first column represents the input text to categorize.
  • The second column represents the category that applies to the input text to categorize:
    • If you have hierarchical categories, you must provide the leaf node as the category.
    • If you have multiple categories, you must separate them by using the pipe character. 
      For example, Hardware | Component | Memory
    • The length of the data must not exceed 1024 characters. The data can contain all special characters.
    •  If you are using a backslash (\) within the entered data, you must use four backslashes (\\\\) instead of one backslash (\). 

      Example: <Text1>\\\\<Text2>\\\\<Text3>. Enter hardware\\\\software instead of hardware\software. 

    • Each row must contain an input text value to categorize and a category that applies to the input text.
    • Each record is terminated by an end-of-line character.

Examples of data sets

The following images provide an example of CSV data sets after they are split into training data and test data:

  • Example of CSV training data

    22_1_sample training data.png
  • Example of CSV test data

    22_1_test data.png

Limit of CSV data sets

The CSV data file must have at least 5 rows and not more than 25,000 input texts (intent examples) associated with 2000 categories (intents). The maximum training data sets that you can create is based on your Watson Assistant Service license as given below:

  • Watson Assistant Standard Service—40,000 CSV training data sets in shared mode. 
  • Watson Assistant Premium Service—600 CSV training data sets in dedicated mode.

For more information about Watson Assistant Service licensing, see IBM Watson Conversation Service documentation.

Guidelines to create a training data CSV file 

 When you create a training data CSV file, follow these guidelines: 

  • Eliminate empty rows with blank text or blank categories.
  • Remove rows with exact duplicate examples.
  • Create balanced data sets with a roughly similar number of examples for all the categories (roughly 20-30 examples for each category). If you have lesser input texts, the test metrics for such categories are on the lower side.
  • Limit the length of input text to fewer than 60 words (1,024 characters).
    The text that exceeds 1024 characters is truncated.
  • Limit the number of categories to several hundred classes.
  • Include multiple classes only if the text is vague and identifying a single class is not clear.
  • Limit the number of records to 25,000 associated with 2,000 classes.
    The minimum number of records allowed is 5 and the maximum number of records allowed per workspace is 25,000.

For more information about the training data guidelines, see Using your own datain IBM documentation.

Scenario of testing the cognitive service data sets by using CSV file

An organization uses AI Service Management (Categorization and Classification) to automatically route end user service requests for increasing RAM as Hardware | Component | Memory. As an administrator, you create a CSV data set to train the cognitive service to auto-categorize this service request to the correct support group. Before implementing, you also test whether the data set correctly categorizes the service requests.

Example of training data

According to the training data, when an end user requests for increasing RAM, the ticket is categorized as Hardware | Component | Memory. You want to test the data set to check whether variations of the request such as increase laptop RAM and increase RAM on network file server are also categorized as Hardware | Component | Memory. You also specify the percentage of data that must be used for training the cognitive service, for example, 70%. The CSV file is randomly split into training data and test data according to the specified percentage.

Example of test results

The test results CSV file show that the variation Increase RAM on network file server is categorized incorrectly as Network | Router | Remote Access Server. The administrator adds more examples of this variation in the data set so that cognitive service categorizes it correctly.

After the system automatically splits the CSV file into training data set and test data set, the limit of the test data is 10,000 rows.

Scenario of testing the cognitive service data sets by using application data

Continuing with the above example, an organization uses AI Service Management (Categorization and Classification) capabilities to automatically categorize the end user requests for increasing RAM as Hardware | Component | Memory. As an administrator, you want to use application data to train the cognitive service. The Request Service record definition in your application is used to raise service requests.

Example of training data

The administrator selects the Request Service record definition and the Requester, Summary, Description, Categorization Tier 1, Categorization Tier 2, and Categorization Tier 3 fields in the record definition from which data is used to train and test the cognitive service. You also specify the percentage data that must be used for training the cognitive service, for example, 70%. The application data is randomly split into training data and test data according to the specified percentage. After testing the application data, a CSV file of the test results is generated. 

Example of test results

The test results CSV file shows that the variation Increase RAM on network file server is incorrectly categorized as Network | Router | Remote Access Server. The administrator adds more examples of this variation to the data set so that the cognitive service categorizes it correctly.

Where to go from here

To understand how to train and test the cognitive service for a custom application, see Training-and-testing-the-cognitive-service-for-a-custom-application.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*