Ingesting data into BMC HelixGPT

After defining the data sources for the chatbot, knowledge article search, and summarization use cases, you must ingest data into the database through data connection jobs. The data connection jobs collect data from the configured data sources and ingest the data into the BMC HelixGPT database.

After the data is ingested, users receive responses to their queries from the information that is ingested into the BMC HelixGPT database. Users have a seamless experience of getting the appropriate answer across multiple data sources.

To ingest data into BMC HelixGPT you can set scheduler rules available out-of-the-box.

You can enable BMC HelixGPT to read text from attachments linked to BMC Helix Innovation Studio record definitions.

BMC HelixGPT can read text data from the following attachment types:

txt
docx
pdf

However, the attachments linked to a record definition are unavailable as an out-of-the-box data source. You must create a data connection if you plan to use text from attachments.
For more information about using text from attachments linked to record definitions, see Read text from attachments.

Before you begin

You must have the HelixGPT Administrator role to ingest data into BMC HelixGPT.

Process for setting up BMC HelixGPT

The following image shows the process of setting up BMC HelixGPT and the current step that you are on:

(Optional) Configure tags in data sources for indexing

Adding tags as metadata in data sources helps you differentiate the content indexed from the different connections of the same data source. You can add tags relevant to your data sources, such as product name and version.

Before you add tags to your data sources, make sure to update your router prompt with the same tags. Note that the tags are case sensitive.

The following code is a sample router prompt with the tags defined in it:

Click here to view the sample router prompt with tags

You are an intelligent virtual assistant and you need to decide whether the input text is one of the catalog services or information request.
This is a classification task that you are being asked to predict between the classes: catalog services or information or tools requests.
Returned response should always be in JSON format specified below for both classes.
{global_prompt}
Do not include any explanations, only provide a RFC8259 compliant JSON response following this format without deviation:
{{
        "classificationType": "catalog service",
        "nextPromptType": next prompt type,
        "services": [
                        {{
                            "serviceName": "service name",
                            "confidenceScore": confidence score,
                            "nextPromptType": "prompt type"
                        }},
                        {{
                            "serviceName": "some other service",
                            "confidenceScore": confidence score,
                            "nextPromptType": "some other prompt type"
                        }}
                    ]
        "userInputText": "input text here",
        "filters": {{
                 "Product Name": "product name if specified",
                 "Version": "version if specified"
         }},
    }}

Ensure these guidelines are met.

0. If there are multiple possible matches for a user request, please ask the user to disambiguate and clarify which
match is preferred.

If user input text is a question that begins with "How", "Why", "How to" or "How do", classify the
input text as 'information request' in the classification field of the result JSON. The JSON format should be:
   {{
        "classificationType": "information service",
        "nextPromptType": "Knowledge",
        "services": [
            {{
                "serviceName": "Dummy",
                "confidenceScore": "1.0",
                "nextPromptType": "Knowledge"
            }}
        ],
        "userInputText": "....",
        "filters": {{
                 "Product Name": "product name if specified",
                 "Version": "version if specified"
         }},
    }}
    In case the classification type is "information service" then don't change the attribute value for 'nextPromptType' in the JSON.
For information requests, determine if there are any specific product names mentioned that correspond to the following tags. If there are synonyms,
convert them to tags and use those.

Tags:
"product_name": Can be one of "DWP", "ITSM", "AMI", "HELIXGPT", "VPN" etc.
"version": Any version number mentioned

Synonyms for "ITSM" are: "BMC Helix ITSM", "BMC ITSM", "ITSM", "Helix IT service management"
Synonyms for "DWP" are: "BMC Digital Workplace", "BMC DWP", "DWP", "Digital Workplace"

If these product names are present in the user's query, include them in the "filters" field of the JSON response.
If no specific product names or versions are mentioned, leave the filters empty. You can have just product names in queries.
Examples:

For the query "How to add a process in DWP 23.03?", the filters would be:
"filters": {{
"Product Name": "DWP",
"Version": "23.03"
}}
For the query "What are the new features in IT Service management 22.11?", the filters would be:
"filters": {{
"Product Name": "ITSM",
"Version": "22.11"
}}
For a general query like "How to troubleshoot network issues?", the filters would be empty:
"filters": {{}}

2. The list of catalog services is shown below along with the corresponding prompts.

Use only this list.

List of catalog services and corresponding prompt types are:
~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~

3. If there are multiple catalog services that match the input text, then show the catalog services and sort them by highest confidence.
Set the "services" field in the result JSON. 'text' field should have the input text. Output JSON:
   {{
        "classificationType": "catalog service",
        "nextPromptType": "Service",
        "services": [
                        {{
                            "serviceName": "service name 1",
                            "confidenceScore": highest confidence score,
                            "nextPromptType": "prompt type 1"
                        }},
                                                {{
                            "serviceName": "service name 2",
                            "confidenceScore": second highest confidence score,
                            "nextPromptType": "prompt type 2"
                        }},
                    ],
        "userInputText": "...."
    }}

4. When your confidence on matching to a single catalog service is very high, classify the input text as 'catalog service' and show the matching service and ask the user for
confirmation of the service picked. Once a single service is selected, set the "services" field in result
JSON to this selected service. 'text' field should have the input text. Output JSON:
   {{
        "classificationType": "catalog service",
        "nextPromptType": "Service",
        "services": [
                        {{
                            "serviceName": "service name",
                            "confidenceScore": confidence score,
                            "nextPromptType": "prompt type"
                        }}
                    ],
        "userInputText": "...."
    }}

5. If the user input text is a query about
    a. a request or a service request,
    b. a list of requests or a list of service requests
    c. an appointment or a list of appointments
    d. a task or a list of tasks,
    e. a todo or a list of todos
    f. what is the status of request REQXXXX
    g. what is the details of request REQXXXX
    h. summarize requests
    i. an existing request
    j. contains a string like REQXXXX
    k. what is the status of request XXXX
    l. what is the details of request XXXX
    m. contains a string like XXXX
    n. an existing ticket or incident,
    o. list of tickets or incidents,
    p. details of a ticket or incident,
    q. show my tickets
    r. summarize tickets or incidents
then classify the input text as 'requests' in the classification field of the result JSON. The JSON format should be
   {{
       "classificationType": "requests",
       "nextPromptType": "Request",
       "services": [
          {{
             "serviceName": "Dummy",
             "confidenceScore": "1.0",
             "nextPromptType": "Request"
          }}
       ],
       "userInputText": "...."
    }}

6. If the user input text is a query about
    a. connect to an agent
    b. want to talk to agent
    c. chat with live agent
    d. live agent
    e. agent
then classify the input text as 'live chat' in the classification field of the result JSON. The JSON format should be
   {{
       "classificationType": "live chat",
       "nextPromptType": "Live Chat",
       "services": [
          {{
             "serviceName": "LiveChatService",
             "confidenceScore": "1.0",
             "nextPromptType": "Live Chat"
          }}
       ],
       "userInputText": "...."
    }}

7. If the user input text don't match any of the other classifications,
then classify the input text as 'fallback' in the classification field of the result JSON. The JSON format should be
   {{
       "classificationType": "fallback",
       "nextPromptType": "Fallback",
       "services": [
          {{
             "serviceName": "FallbackService",
             "confidenceScore": "1.0",
             "nextPromptType": "Fallback"
          }}
       ],
       "userInputText": "...."
    }}

8. Based on the classification, if the request is for catalog services, set 'classification' in JSON to 'catalog service'.
9. Based on the classification, if the request is for information request, set 'classification' in JSON to 'information request'.
10. Based on the classification, if the request is for request or ticket or incident, set 'classification' in JSON to 'requests'

1. Based on the classification, if the request is for live chat, set 'classification' in JSON to 'live chat'
  12. Based on the classification, if the request is for fallback, set 'classification' in JSON to 'fallback'
  13. ONLY EVER SEND A JSON RESPONSE, NEVER SEND INFORMATION OR A SUMMARY. THIS IS THE MOST IMPORTANT RULE TO FOLLOW.

14. If user input text is a greetings that contains phrases such as "hi" or "hello", "how are you", "How do you do" etc. or if its an expressions of gratitude User such as "thank you" or similar then classify the
input text as 'response request' in the classification field of the result JSON. The JSON format should be:
   {{
  "classificationType": "response service",
  "nextPromptType": "Response",
  "services": [
   {{
    "serviceName": "Dummy",
    "confidenceScore": "1.0",
    "nextPromptType": "Response"
   }}
  ],
  "userInputText": "...."
}}
In case the classification type is "response service" then don't change the attribute value for 'nextPromptType' in the JSON.

{input}

To add tags to data sources:

On the Innovation Studio > Workspace tab, select HelixGPT Manager.
Select the Connection record definition and click Edit data.
The Data editor page opens.
The tags you add must be the same as those added in the router prompt.
In Data editor, click the data source for which you want to add the tags.
The Edit record window opens.
In the Edit record window, add the required tags.
Click Save.

Ingesting data in BMC HelixGPT

Perform the following tasks to ingest data into BMC HelixGPT :

Step	Action	Reference
1	Set scheduler rules available out-of-the-box.	To ingest data by specifying a schedule
1	(Optional) Create a connection to read text from attachments.	To read text from attachments
2	Create a data connection job.	To ingest data by creating a data connection job
3	Verify the data connection.	To verify the data connection

Task 1: To ingest data by specifying a schedule

Scheduling the index update helps to keep the data sources updated. Having an up-to-date index helps models find the required data more efficiently. It also helps to keep the indexes updated with the latest changes.

In BMC HelixGPT Manager, click Settings.
Select HelixGPT > Connections > Information sources.
Click the connection name for which you want to add the schedule.
The Edit connection window opens.
Click Schedule index updates.
The Schedule section opens.

In the Schedule section, specify the following fields:

Field	Description
Month dates	Select the dates of the month when you want to run the index update job.
Days of the week	Select the days of the week when you want to run the index update job.
Time	Set the time to run the job on the scheduled dates and days.

Click Save.

To add the schedule for index updates for new connections, first add a connection and then edit it to configure the schedule.
When you select the dates of the month and the days of the week, the indexing job runs on the selected dates and also on the selected days.
The index updates already exist for the out-of-the-box information sources; you can change the existing configuration by following the steps in this section.

(Optional) To read text from attachments linked to BMC Helix Innovation Studio record definitions

You can enable BMC HelixGPT to read text from attachments linked to a record definition. To do this, you must connect with an existing record definition.

On the Innovation Studio > Workspace tab, select HelixGPT Manager.
Select a record definition and click Edit data.

The data editor is displayed.
In Data editor, click New to add a new record.
The New record dialog box is displayed.
On the File tab, select an attachment that you want to add.
You can add a text file, a Microsoft Word file, or a PDF file as an attachment.
In the Name field, enter the name of the file you want to attach.
Click Save.

The new record definition you created is available in the HelixGPT Manager.

To create a connection record

In the HelixGPT Manager, select Connection_Record_Definition and click Edit data.
Click New to create a new connection record definition.
A new record dialog box is displayed:
In the DataSource ID list, select RECORD_DEFINITION.
In the Field ID field, enter the Field ID of the File field on any record definition.
For example, RecordDefinitionSample.
In the Record Definition field, enter the name of the record definition you have given.
In the Name field, enter the name of the record definition.
For example, RecordDefinitionSample.
Click Save.

A connection record is created to read data from attachments.

Task 2: To ingest data by creating a data connection job

You can ingest data into the BMC HelixGPT database by creating a data connection job. All published documents are ingested from the data sources into BMC HelixGPT. From the SharePoint and Confluence data sources, attached documents, such as PDFs, Microsoft Word documents, and plain text files, are ingested. You can also ingest a single document by specifying the document or article ID. However, The SharePoint web pages are not ingested.

Log in to BMC Helix Innovation Studio.
On the Workspace tab, click HelixGPT Manager.
On the Records tab, select the DataConnectionJob record definition and click Edit data, as shown in the following image:
On the Data Editor (DataConnectionJob) page, click New.

In the New Record pane, specify the following information:

In the Data source field, enter one of the following data sources:

Data source	Value to be entered
BMC Helix Business Workflows	BWF
BMC Helix Knowledge Management by ComAround	HKM
BMC Helix ITSM: Knowledge Management	RKM
Confluence	CNF
Microsoft SharePoint Online	SPT
Web	WEB
BMC Helix Customer Service Management	CSM
Salesforce Knowledge	SALESFORCE_KNOWLEDGE
Attachments of a record defintion	RECORD_DEFINITION

Specify a description for the data connection job.
(Optional) Specify the Assignee.
Specify the Connection ID.
The Connection ID is the ID that you noted when you added the data source successfully in HelixGPT Manager in Adding-data-sources-in-BMC-HelixGPT.

(Optional) To ingest a single document, specify the DocDisplayId and DocId, or click Attach file, and select a file.
The DocDisplayId or DocId is the unique ID of the single document that you want to upload, such as the article display ID in BMC Helix ITSM: Knowledge Management, content ID in BMC Helix Knowledge Management by ComAround, and article UUID or content ID in BMC Helix Business Workflows.
The following table shows the usage of DocDisplayId and DocId:

Data source	Inputs	Example	Scope	Notes
RKM	NA	NA	All RKM articles	NA
RKM	DocId = <article instance ID>	Datasource = RKM DocId = KMHAA5V0GPLUUANDADAXGA6CSQG49C	A single RKM article	Use the instance ID of RKM:KnowledgeArticleManager.
RKM	DocDisplayId = <article display ID>	Datasource = RKM DocDisplayId = KBA90000067	A single RKM article	The display ID is visible in the BMC Helix ITSM: Knowledge Management user interface.
HKM	NA	NA	All HKM articles	NA
HKM	DocId = <article "content ID">	Datasource = HKM DocId = 1721446-2537-1033-1772837	A single HKM article	NA
HKM	ConnectionId = <Connection_HKM record ID>	Datasource = HKM ConnectionId = AGGADGG8ECDC2ASI46SDSI46SD3O1X	All HKM articles while given user is being impersonated.	Using a connection allows a user to impersonate another user when connecting to BMC Helix Knowledge Management by ComAround. It is sometimes needed because the default IS user might not have the correct group mappings in BMC Helix Knowledge Management by ComAround. To specify such a user, you must create or update a record in the Connection_HKM record definition.
BWF	NA	NA	All BWF articles	NA
BWF	DocId = <article UUID>	DocId = AGGADG1AAP0ICAOQVYJ6OPZVOTL7BU	A single BWF article	Use field 379 of BWF:KnowledgeArticleTemplate.
BWF	DocDisplayId = <article "Content ID">	DocDisplayId = KA-000000000007	A single BWF article	The ID is visible in the BMC Helix Business Workflows user interface.

To run the job immediately, enable the Execute now toggle key.
If you are updating data, in the ModifiedSince field, specify the date and time since it was last updated.
Use this option for incremental updates, meaning only indexed documents modified since a date.
To delete the data from BMC HelixGPT that has been deleted from the source, select Sync deletions
The following screen shows an example of creating a new data connection job:

Click Save.

Repeat the steps to add multiple data connection jobs.

Verifying data ingestion

Data ingestion takes place one item at a time, and the time required for the ingestion to be completed depends on the number of documents to be ingested and the amount of data. If a user asks queries during data ingestion, the responses might be incorrect or incomplete. Therefore, it is important to verify that data ingestion is completed successfully.

Log in to BMC Helix Innovation Studio.
On the Workspace tab, click HelixGPT Manager.
On the Records tab, select the DataConnectionJobStep record definition and click Edit data.
Verify that the status of the job that you created is DONE.
The following image shows sample jobs with the DONE status:

Result

The following screenshot shows BMC HelixGPTfetching data from a PDF file attached to a record definition:

Where to go from here

Provisioning-and-setting-up-the-generative-AI-provider-for-your-application