Creating file system and database collections for searching external data


To include file system and database tables that are outside of BMC Helix ITSM: Knowledge Management, you must create an IBM Watson Discovery collection. After you create the collection, use the IBM Watson Data Crawler to upload this data to the IBM Watson Discovery collection. You can then map the collection to an external search data set in BMC Helix Innovation Studio.

Important

If you want to include BMC Helix ITSM: Knowledge Management articles in cognitive search, you must install and configure the BMC crawler. For more information, see Installing-and-configuring-the-cognitive-search-data-crawler-for-BMC-Helix-ITSM-Knowledge-Management-articles.

The following image outlines the tasks that you must perform to create the collection in IBM Watson Discovery. After you create the collection, you can upload your external data into this collection.

OverviewImage.png

Before you begin

  • You must have the IBM Watson Discovery account credentials (API key). 
  • You must have access to a Linux virtual machine on which the data crawler is set up. 
  • You must have the username and password of the database for which you want to crawl data.

To create a project and collection

Perform the following steps to create the project required for defining the IBM Watson Discovery collection.

  1. Log in to the IBM Watson Discovery V2 instance.
  2. Open an existing project to create a new collection.
  3. If a project is not existing, then create a new project in the IBM Watson Discovery V2 instance.
    1. Click New Project.
    2. On the Select project type page, enter Project name.
    3. From the Project type dropdown, select Document Retrieval.

      Important

      BMC supports only Document Retrieval.

      For more information, see Create Project in IBM documentation.

  4. On the project, go to Manage collections pane.
    ManageCollections.png

  5. Click New Collection.
    For more information, see Create Collections in documentation.

  6. Upload data to your collection.
    For more information, see Upload data in IBM documentation.

To upload a stop word list

  1. Open an existing project.
  2. Open the Improvize and customize pane.
    ImprovizeAndCustomize.png

  3. Go to Improve Relevance and select Stopwords.
    ImproveRelevance_Stopwords.png

  4. Click Upload stopwords, to upload the stop word file.
    File format: <filename>.json
    The IBM Watson Discovery search will ignore the word mentioned in the stop word file.

To install the data crawler

You must use the data crawler to upload documents into the IBM Watson Discovery collection. Perform the following steps to install the data crawler.

Important

You must have the data crawler executable file for Linux in either DEB, RPM, or ZIP format. To get the executable file, contact BMC Support.

  1. Depending on the operating system, run the corresponding command to install the crawler:

    Operating system

    Command

    On Red Hat and CentOS virtual machines that use rpm packages

    rpm -i /full/path/to/rpm/package/rpm-file-name

    On Ubuntu and Debian virtual machines that use deb packages

    dpkg -i /full/path/to/deb/package/deb-file-name

    The crawler scripts are installed into the installation_directory/bin directory, such as /opt/ibm/crawler/bin

    The crawler scripts are also installed into the /usr/local/bin directory.

  2. Create a working directory and copy the contents of installation_directory/share/examples/config folder to the working directory, such as /home/config.
  3. On the virtual machine, run the following commands to set the project variables:

    Commands to set project variables
    export JAVA_HOME=/opt/jdk
    export PATH=/opt/jdk/jre/bin:$PATH
    export PATH={installation_directory}/bin:$PATH

    With the data crawler installed, you can upload the file system documents and data from the database tables into the IBM Watson Discovery collection.

To upload file system documents by using the crawler

After you have installed the crawler on the virtual machine, perform the following steps to configure the crawler for uploading file system-based documents to the IBM Watson Discovery collection.

Important

In the following steps, the directory paths are specified considering that you have copied the contents of installation_directory/share/examples/config folder to /home/config while installing the data crawler. If you have not copied the contents of installation_directory/share/examples/config folder to /home/config, you must replace /home/config folder path with the appropriate path.

  1. On the virtual machine, go to the /home/config/connectors directory, and open the filesystem.conf file, and specify the following values:

    Parameter name

    Action

    Value

    protocol

    Enter the name of the connector protocol used for the crawler.

    sdk-fs

    collection 

    Enter the attribute used to unpack temporary files.

    crawler-fs

    logging-config

    Enter the file name that is used for configuring the logging options.

    Must be formatted as a log4j XML string.

    classname

    Enter the Java class name for the connector.

    plugin:filesystem.plugin@filesystem

  2. Go to the /home/config/seeds directory, open the filesystem-seed.conf file, and specify the following values:

    Parameter name

    Action

    Value

    url

    Enter the list of files and folders to upload.
    Use a newline character to separate each list entry.

    For example, to crawl the /home/watson/mydocs folder, the value of this URL is sdk-fs:///home/watson/mydocs.

  3. Go to the /home/config/ directory, and open the crawler.conf file, and specify the following values. 

    Parameter name

    Action

    Value

    crawl_config_file

    Enter the path of the configuration file that you updated in step 1.

    connectors/filesystem.conf

    crawl_seed_file

    Enter the path of the seed configuration file that you updated in step 2.

    seeds/filesystem-seed.conf

    output_adapter class

    Enter the following value:

    class - 

    "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter",

    config - "discovery_service", 

    discovery_service { 

        include "discovery/discovery_service.conf" 

    },


    For other parameters, see Configuring crawler options.

  4. Go to the /home/config/discovery directory, open the discovery_service.conf file, and specify the following values.

    To obtain the values for this step, log into your IBM Watson Discovery account and navigate to the file system collection that you created.

    Parameter name

    Action

    project_id 

    Enter the Project ID of the collection.

    collection_id 

    Enter the Collection ID of the collection.

    configuration_id 

    Enter the Configuration ID of the collection.

    configuration

    Enter the complete path of this discovery_service.conf file, such as /home/config/discovery/discovery_service.conf.

    apikey

    Enter the API key of your IBM Watson Discovery service instance.

    Important

    The base URL changes based on the region of the IBM Watson Discovery service instance.

  5. To upload the documents from the virtual machine to the IBM Watson Discovery collection, run the following command from the installation_directory/bin directory:

    Command to upload the documents
    ./crawler crawl --config /home/config/crawler.conf
  6. To verify that the documents were successfully uploaded in IBM Watson Discovery, check the console logs.
    [crawler-output-adapter-41] INFO: HikariPool-1 - Shutdown initiated...

    [crawler-output-adapter-41] INFO: HikariPool-1 - Shutdown completed.
    [crawler-io-13] INFO: The service for the Connector Framework Input Adapter was signaled to halt.

    You can also log in to IBM Watson Discovery and view the number of uploaded documents. 

To upload database data by using the crawler

After you have installed the crawler on the virtual machine, perform the following steps to configure the crawler for uploading the database tables to the IBM Watson Discovery collection.

Important

In the following steps, the directory paths are specified considering that you have copied the contents of installation_directory/share/examples/config folder to /home/config while installing the data crawler. If you have not copied the contents of installation_directory/share/examples/config folder to /home/config, you must replace /home/config folder path with the appropriate path.

  1. On the virtual machine, go to the /home/config/connectors directory, and open the database.conf file, and specify the following values:

    Parameter name

    Action

    Value

    protocol

    Enter the name of the connector protocol used for the crawler.

    sqlserver

    collection 

    Enter the attribute used to unpack temporary files.

    tempcollection

    logging-config

    Enter the file name that is used for configuring the logging options.

    Must be formatted as a log4j XML string.

    classname

    Enter the Java class name for the connector.

    plugin:database.plugin@database

  2. Go to the /home/config/seeds directory, and open the database-seed.conf file, and specify the following values:

    Parameter name

    Action

    Example

    url

    Enter the seed URL for your custom SQL database.
    The structure of the URL is as follows: database-system://host:port/database?[per=number of records]&[sql=SQL]

    sqlserver://mydbserver.test.com:5000/countries/street_view?per=1000

    user-password

    Enter the credentials for the database system.
    Important: 
    You must separate the username and password by using a colon.
    You must encrypt the password by using the vcrypt utility that is available with the data crawler.
    Encrypt the password by issuing the following command:
    vcrypt --encrypt --keyfile /home/config/id_vcrypt -- "myPassw0rd" > /home/config/db_pwd.txt

    None

    jdbc-class

    Enter the name of the jdbc driver.

    com.microsoft.sqlserver.jdbc.SQLServerDriver

    connection-string 

    If you enter a value, this string will override the automatically generated JDBC connection string. Enter a value if you want to provide a more detailed configuration about the database connection, such as load-balancing or SSL connections. 

    Important

    The third-party JDBC drivers are located in the connectorFramework/crawler-connector-framework-#.#.#/lib/java/database folder within the crawler installation directory. 

     You can use the extra_jars_dir parameter in the crawler.conf file to specify another location.

  3. Go to the /home/config/ directory, and open the crawler.conf file, and specify the following values. 

    For other parameters, see Configuring crawler options.

    Parameter name

    Action

    Value

    crawl_config_file

    Enter the path of the configuration file that you updated in step 1.

    connectors/filesystem.conf

    crawl_seed_file

    Enter the path of the seed configuration file that you updated in step 2.

    seeds/filesystem-seed.conf

    output_adapter class

    Enter the following value:

    class = "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter",

      config = "discovery_service",

      discovery_service {

        include "discovery/discovery_service.conf"

      },


  4. Go to the /home/config/discovery directory, and open the discovery_service.conf file, and specify the following values.

    To obtain the values for this step, log into your IBM Watson Discovery account and navigate to the file system collection that you previously created.

    Parameter name

    Action

    project_id 

    Enter the Project ID of the collection.

    collection_id 

    Enter the Collection ID of the collection.

    configuration_id 

    Enter the Configuration ID of the collection.

    configuration

    Enter the complete path of this discovery_service.conf file.

    For example, /home/config/discovery/discovery_service.conf.

    apikey

    Enter the API key of your IBM Watson Discovery service instance.

    Important

    The base URL changes based on the region of the IBM Watson Discovery service instance.

  5. To upload the documents from the virtual machine to the IBM Watson Discovery collection, run the following command from the installation_directory/bin directory:

    Command for uploading documents
    ./crawler crawl --config /home/config/crawler.conf
  6. To verify that the documents are successfully uploaded to IBM Watson Discovery, check the console logs.

    You can also log in to IBM Watson Discovery and view the number of uploaded data.

Where to go from here

To define the search data sets that you want to use for cognitive search, see Defining-search-data-sets.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*