Types of data sets used to train and test the cognitive service
In addition to training the cognitive service, you can test whether the cognitive service is trained correctly so that it predicts the desired categories. You can evaluate the cognitive service against the following test metrics:
- Accuracy
- Precision
- Recall
- F-score
For more information about the test metrics and how to use them to improve the training data set, see Leveraging-machine-learning-metrics-to-improve-cognitive-service-data-sets.
For more information about testing the chatbot application, see BMC Helix Virtual Agentdocumentation.
Training data set and test data set
After creating a CSV file, administrators can specify the percentage to split the rows into training data and test data. For new data sets, by default, random 80% of the rows are used as training data and the remaining 20% are used as test data. For existing data sets, by default, 100% of the rows are used as training data.
The following table explains the difference between training data sets and test data sets:
Training data set | Test data set |
---|---|
Used to train the cognitive service so that the input text is correctly categorized. | Used to test whether the cognitive service correctly categorizes the input text based on the training data. |
Can be a CSV file or runtime application data. |
|
For more information about testing the cognitive service, see Leveraging-machine-learning-metrics-to-improve-cognitive-service-data-sets.
CSV data set and application data set
You can create a CSV file of the data set or specify application data for training and testing AI Service Management (Categorization and Classification) capabilities:
Type of data set | Description | Reference |
---|---|---|
CSV data set | A CSV file is used to train and test the cognitive service. The CSV file contains an input text to be categorized and the category for the input text. The CSV file must have at least 5 and not more than 25,000 categories associated with 2,000 input texts. For applications that do not require continuous cognitive service training, you can train and test the cognitive service by using data from CSV files. | |
BMC Helix Innovation Studio data set | The application record definitions and record fields are used to train and test the cognitive service. For applications that are used in a continuously changing business environment that requires frequent training of the cognitive service, you can use application data to train the cognitive service. This approach helps the cognitive service get updated data and provide suggestions according to the business changes. | Not applicable |
CSV data set structure
A CSV file that is used for training and testing the cognitive service should have the following structure:
- A row represents a record.
- The first column represents the input text to categorize.
- The second column represents the category that applies to the input text to categorize:
- If you have hierarchical categories, you must provide the leaf node as the category.
- If you have multiple categories, you must separate them by using the pipe character.
For example, Hardware | Component | Memory - The length of the data must not exceed 1024 characters. The data can contain all special characters.
If you are using a backslash (\) within the entered data, you must use four backslashes (\\\\) instead of one backslash (\).
Example: <Text1>\\\\<Text2>\\\\<Text3>. Enter hardware\\\\software instead of hardware\software.
- Each row must contain an input text value to categorize and a category that applies to the input text.
- Each record is terminated by an end-of-line character.
Examples of data sets
The following images provide an example of CSV data sets after they are split into training data and test data:
- Example of CSV training data
- Example of CSV test data
Limit of CSV data sets
The CSV data file must have at least 5 rows and not more than 25,000 input texts (intent examples) associated with 2000 categories (intents). The maximum training data sets that you can create is based on your Watson Assistant Service license as given below:
- Watson Assistant Standard Service—40,000 CSV training data sets in shared mode.
- Watson Assistant Premium Service—600 CSV training data sets in dedicated mode.
For more information about Watson Assistant Service licensing, see IBM Watson Conversation Service documentation.
Guidelines to create a training data CSV file
When you create a training data CSV file, follow these guidelines:
- Eliminate empty rows with blank text or blank categories.
- Remove rows with exact duplicate examples.
- Create balanced data sets with a roughly similar number of examples for all the categories (roughly 20-30 examples for each category). If you have lesser input texts, the test metrics for such categories are on the lower side.
- Limit the length of input text to fewer than 60 words (1,024 characters).
The text that exceeds 1024 characters is truncated. - Limit the number of categories to several hundred classes.
- Include multiple classes only if the text is vague and identifying a single class is not clear.
- Limit the number of records to 25,000 associated with 2,000 classes.
The minimum number of records allowed is 5 and the maximum number of records allowed per workspace is 25,000.
For more information about the training data guidelines, see Using your own datain IBM documentation.
Where to go from here
To understand how to train and test the cognitive service for a custom application, see Training-and-testing-the-cognitive-service-for-a-custom-application.