Types of data sets used to train and test the cognitive service

AI Service Management (Categorization and Classification) provides auto-categorization, auto-assignment, and chatbot capabilities. To utilize these capabilities in an application, you must train the IBM Watson Assistant Service to work with your data. A developer might create a sample training data set for your application. An administrator updates the sample data set or creates new training data sets. The administrator then uses the training data sets in existing or new processes or rules for the application.

Training data set and test data set

After creating a CSV file, administrators can specify the percentage to split the rows into training data and test data. For new data sets, by default, random 80% of the rows are used as training data and the remaining 20% are used as test data. For existing data sets, by default, 100% of the rows are used as training data.

The following table explains the difference between training data sets and test data sets:

Training data set	Test data set
Used to train the cognitive service so that the input text is correctly categorized.	Used to test whether the cognitive service correctly categorizes the input text based on the training data.
Can be a CSV file or runtime application data.	Always is a CSV file. The test results are generated in a CSV file or can be viewed from BMC Helix Innovation Studio.

For more information about testing the cognitive service, see Leveraging-machine-learning-metrics-to-improve-cognitive-service-data-sets.

CSV data set and application data set

You can create a CSV file of the data set or specify application data for training and testing AI Service Management (Categorization and Classification) capabilities:

Type of data set	Description	Reference
CSV data set	A CSV file is used to train and test the cognitive service. The CSV file contains an input text to be categorized and the category for the input text. The CSV file must have at least 5 and not more than 25,000 categories associated with 2,000 input texts. For applications that do not require continuous cognitive service training, you can train and test the cognitive service by using data from CSV files.	CSV data sets structure Limit of CSV training data sets Guidelines to create a training data CSV file
BMC Helix Innovation Studio data set	The application record definitions and record fields are used to train and test the cognitive service. For applications that are used in a continuously changing business environment that requires frequent training of the cognitive service, you can use application data to train the cognitive service. This approach helps the cognitive service get updated data and provide suggestions according to the business changes.	Not applicable

CSV data set structure

A CSV file that is used for training and testing the cognitive service should have the following structure:

A row represents a record.
The first column represents the input text to categorize.
The second column represents the category that applies to the input text to categorize:
- If you have hierarchical categories, you must provide the leaf node as the category.
- If you have multiple categories, you must separate them by using the pipe character.
  For example, Hardware | Component | Memory
- The length of the data must not exceed 1024 characters. The data can contain all special characters.
- If you are using a backslash (\) within the entered data, you must use four backslashes (\\\\) instead of one backslash (\).
  Example: <Text1>\\\\<Text2>\\\\<Text3>. Enter hardware\\\\software instead of hardware\software.
- Each row must contain an input text value to categorize and a category that applies to the input text.
- Each record is terminated by an end-of-line character.

Examples of data sets

The following images provide an example of CSV data sets after they are split into training data and test data:

Example of CSV training data
Example of CSV test data

Limit of CSV data sets

The CSV data file must have at least 5 rows and not more than 25,000 input texts (intent examples) associated with 2000 categories (intents). The maximum training data sets that you can create is based on your Watson Assistant Service license as given below:

Watson Assistant Standard Service—40,000 CSV training data sets in shared mode.
Watson Assistant Premium Service—600 CSV training data sets in dedicated mode.

For more information about Watson Assistant Service licensing, see IBM Watson Conversation Service documentation.

Guidelines to create a training data CSV file

When you create a training data CSV file, follow these guidelines:

Eliminate empty rows with blank text or blank categories.
Remove rows with exact duplicate examples.
Create balanced data sets with a roughly similar number of examples for all the categories (roughly 20-30 examples for each category). If you have lesser input texts, the test metrics for such categories are on the lower side.
Limit the length of input text to fewer than 60 words (1,024 characters).
The text that exceeds 1024 characters is truncated.
Limit the number of categories to several hundred classes.
Include multiple classes only if the text is vague and identifying a single class is not clear.
Limit the number of records to 25,000 associated with 2,000 classes.
The minimum number of records allowed is 5 and the maximum number of records allowed per workspace is 25,000.

For more information about the training data guidelines, see Using your own datain IBM documentation.

Where to go from here

To understand how to train and test the cognitive service for a custom application, see Training-and-testing-the-cognitive-service-for-a-custom-application.