Types of data sets used to train and test the cognitive service
BMC Helix Platform Cognitive Service provides auto-categorization, auto-assignment, and chatbot capabilities. To utilize these capabilities in an application, you must train the IBM Watson Assistant Service to work with your data. A developer might create a sample training data set for your application. An administrator updates the sample data set or creates new training data sets. The administrator then uses the training data sets in existing or new processes or rules for the application.
In addition to training the cognitive service, you can test whether the cognitive service is trained correctly so that it predicts the desired categories. You can evaluate the cognitive service against the following test metrics:
For more information about the test metrics and how to use them to improve the training data set, see Leveraging machine learning metrics to improve cognitive service data sets. For more information about testing the chatbot application, see .
Training data set and test data set
After creating a CSV file, administrators can specify the percentage to split the rows into training data and test data. For new data sets, by default, random 80% of the rows are used as training data and the remaining 20% are used as test data. For existing data sets, by default, 100% of the rows are used as training data.
The following table explains the difference between training data sets and test data sets:
|Training data set||Test data set|
|Used to train the cognitive service so that the input text is correctly categorized.|
|Can be a CSV file or runtime application data.|
For more information about testing the cognitive service, see Leveraging machine learning metrics to improve cognitive service data sets.
CSV data set and application data set
You can create a CSV file of data set or specify application data for training and testing BMC Helix Platform Cognitive Service:
|Type of data set||Description||Reference|
|CSV data set|
A CSV file is used to train and test the cognitive service. The CSV file contains an input text to be categorized and the category for the input text.
The CSV file must have at least 5 and not more than 25,000 categories associated with 2,000 input texts.
For applications that do not require continuous cognitive service training, you can train and test the cognitive service by using data from CSV files.
BMC Helix Platform data set
The application record definitions and record fields are used to train and test the cognitive service.
For applications that are used in a continuously changing business environment that requires frequent training of the cognitive service, you can use application data to train the cognitive service. This approach helps the cognitive service get updated data and provide suggestions according to the business changes.
CSV data set structure
A CSV file that is used for training and testing the cognitive service should have the following structure:
- A row represents a record.
The first column represents the input text to categorize.
- The second column represents the category that applies to the input text to categorize:
- If you have hierarchical categories, you must provide the leaf node as the category.
If you have multiple categories, you must separate them by using the pipe character.
For example, Hardware | Component | Memory
The length of the data must not exceed 1024 characters. The data can contain all special characters.
If you are using a backslash (\) within the entered data, you must use four backslashes (\\\\) instead of one backslash (\).
Example: <Text1>\\\\<Text2>\\\\<Text3>. Enter hardware\\\\software instead of hardware\software.
Each row must contain an input text value to categorize and a category that applies to the input text.
Each record is terminated by an end-of-line character.
Examples of data sets
The following images provide an example of CSV data sets after they are split into training data and test data:
- Example of CSV training data
- Example of CSV test data
Limit of CSV data sets
The CSV data file must have at least 5 rows and not more than 25,000 input texts (intent examples) associated with 2000 categories (intents). The maximum training data sets that you can create is based on your Watson Assistant Service license as given below:
- Watson Assistant Standard Service—40,000 CSV training data sets in shared mode.
- Watson Assistant Premium Service—600 CSV training data sets in dedicated mode.
For more information about Watson Assistant Service licensing, see .
When you create a training data CSV file, follow these guidelines:
- Eliminate empty rows with blank text or blank categories.
- Remove rows with exact duplicate examples.
- Create balanced data sets with roughly similar number of examples for all the categories (roughly 20-30 examples for each category). If you have lesser input texts, the test metrices for such categories is on the lower side.
- Limit the length of input text to fewer than 60 words (1,024 characters).
The text that exceeds 1024 characters is truncated.
- Limit the number of categories to several hundred classes.
- Include multiple classes only if the text is vague and identifying a single class is not clear.
- Limit the number of records to 25,000 associated with 2,000 classes.
The minimum number of records allowed is 5 and the maximum number of records allowed per workspace is 25,000.
For more information about the training data guidelines, see in IBM documentation.