Collecting additional metrics using the Sysdig agent


You can use the Sysdig agent to collect the additional metrics from your Linux and Windows virtual servers. These metrics are useful for gaining operational visibility into the performance and health of your applications, services, and platforms. The Sysdig agent collects these metrics and sends them to Sysdig instance. When you run the IBM Cloud API ETL, these metrics are imported into the BMC Helix Continuous Optimization database.

Collecting Sysdig performance metrics from Linux virtual server

  1. Log in to the virtual server by using your public IP address and root user name.
  2. Provision an instance of the IBM Cloud Monitoring.

    One Sysdig service instance must be provisioned for each region. The user creating Sysdig instance must have "IBM Cloud Monitoring" privileges to create Sysdig instance.

    Steps to manage user access
    1. Log in to the IBM Cloud console.

    2. In the IBM Cloud Console header, click Manage > Access (IAM).
    3. From the left navigation page, select Users.
    4. In the Account users table, identify the user to whom you want to assign the access. From the Actions menu of that user, click Assign access.
    5. Select Assign access within a resource group.
    6. Select a resource group.
    7. If the user does not have a role already granted for the selected resource group, choose a role for the Assign access to a resource group field.
      Depending on the role that you select, the user can view the resource group on their dashboard, edit the resource group name, or manage user access to the group. You can select No access, if you want the user to have access only to the IBM Cloud Monitoring in the resource group.
    8. Select IBM Cloud Monitoring.
    9. Select the platform role Administrator.
    10. Click Assign.
  3. Steps to provision an instance of the IBM Cloud Monitoring service

    To add monitoring features with IBM Cloud Monitoring in the IBM Cloud, you need to provision an instance of the IBM Cloud Monitoring service. You provision an instance within the context of a resource group. A resource group lets you organize your services for access control and billing purposes. You can provision the IBM Cloud Monitoring with Sysdig instance in the default resource group or in a custom resource group. When you provision an instance, you automatically get an ingestion key, known as the Sysdig access key.

    1. Log in to the IBM Cloud console.

    2. From the IBM Cloud dashboard, navigate to the menu ibm_cloud_menu.png> Observability to access the Observability dashboard.
    3. Select Monitoring > Options > Create.
    4. Select the region.
    5. Select a service plan. By default, the Trial plan is set. For more information about the service plans, see Service plans.

    6. Enter a service name.
    7. Select a resource group. By default, the Default resource group is set.
    8. Set on automatic collection of platform metrics by clicking Enable.
    9. Click Create to provision an instance.
      The service UI is displayed.

    To provision an instance of Sysdig by using the CLI, see Provisioning a Sysdig instance by using the CLI.

  4. Steps to configure a Sysdig agent

    To configure your Linux host (Ubuntu server) to send metrics to your IBM Cloud Monitoring instance, install a Sysdig agent.

    Complete the following steps from the command line:

    1. Open the terminal.
    2. Run the following command to log in to the IBM Cloud:

      ibmcloud login -a cloud.ibm.com

      Select the account where the IBM Cloud Monitoring instance is available.

    3. Obtain the Sysdig access key.
      1. Log in to the IBM Cloud console..

      2. From the left navigation page, select Observability.
      3. Select Monitoring. The IBM Cloud Monitoring dashboard is displayed. A list of monitoring instances that are available on IBM Cloud are displayed.
      4. Identify the instance for which you want to get the access key. Select actions, then click View Key. A pop up window opens with the key information.
      5. Click the eye icon to view the access key.

        To obtain the access key by using the CLI, see Getting the access key by using the CLI

    4. Obtain the IBM region list. For information, see Regions and endpoints. .

    5. Obtain the region-specific public collector endpoint. For information, see Public collector endpoints. 

    6. Run the following command to deploy the Sysdig agent on the virtual server:
      curl -s https://s3.amazonaws.com/download.draios.com/stable/install-agent | sudo bash -s -- --access_key <SYSDIG_ACCESS_KEY> --collector <COLLECTOR_ENDPOINT> --collector_port 6443 --secure true --check_certificate false --tags TAG_DATA --additional_conf 'sysdig_capture_enabled: false'

      Example:
      command for frankfurt region from our environment: curl -s https://s3.amazonaws.com/download.draios.com/stable/install-agent | sudo bash -s -- --access_key 2cefff44-4cba-4c8d-afc0-a8563ee8049a --collector ingest.eu-de.monitoring.cloud.ibm.com --collector_port 6443 --secure true --check_certificate false --tags type:sysdig-agent,location:frankfurt,sourceType:virtualserver --additional_conf 'sysdig_capture_enabled: false'
    7. Verify the status of the dragent serviceRun the command: systemctl status dragent.service
      dragent_service_status.png

    After the installation is done, check the contents of the /opt/draios/etc/dragent.yaml file. The values of ssl, ssl_verify_certificate, and sysdig_capture_enabled properties must be set to the following:

    • ssl: true 
    • ssl_verify_certificate: false
    • sysdig_capture_enabled: false

    If these values are not correct in the dragent.yml file, set these properties manually and save the file.

    (Optional) To filter the metrics, add the metrics_filter property in dragent.yaml file. For details, see Including and excluding metrics.

    To view the metrics in the IBM Cloud Sysdig UI, launch the Sysdig Web UI. For details, see Launch the Web UI.. In the Host and containers section, you can find the entry for your Ubuntu server.

     

Collecting Sysdig Performance metrics from Windows virtual server

The Prometheus WMI exporter runs as a Windows service. You can configure the metrics that you want to monitor by enabling the collectors.

The following collectors are supported by IBM:

  • CPU
  • Computer system metrics (cs)
  • Disk metrics
  • Network interface metrics
  1. Configure the Prometheus WMI exporter
    1. Log in to your Windows computer.
    2. Download the Prometheus exporter.
      BMC Helix Continuous Optimization does not support v0.13.0 and later versions of the Prometheus exporter.

    3. Identify the collectors that contain the information for the metric data that you want to send to the Sysdig agent.
    4. Run the wmi_exporter and configure the collectors that you want to enable.
      .\wmi_exporter-0.12.0-amd64.exe --collectors.enabled <COLLECTORS>

      where, <COLLECTORS> indicates the list of connectors that you want to configure
      Example: To collect computer system metrics (cs), CPU metrics, disk metrics, and network interface I/O metrics use the following command:

      .\wmi_exporter-0.10.2-amd64.exe --collectors.enabled "os,cpu,logical_disk,net,system"

    Important: The ETL does not support the latest version of the wmi exporter. Ensure that you download the 0.12.0 version (known as wmi_exporter and not the windows_exporter) of the exporter.

  2. (Optional) Configure the network settings
    1. Enable the Windows firewall to allow access to wmi_exporter-0.12.0-amd64.exe.
    2. (Optional) Update the VPC rules. If you use private endpoints, add an inbound rule to the security group for port 9182 with source type = Security Group and choose the security group for the Windows system.
  3. Collect metrics by running Prometheus as a client collector on Windows

    Use the Prometheus remote-write capabilities to push the metrics from the Windows system by running Prometheus as a client collector on Windows.

    1. Download the Prometheus monitoring system and time series database. Download prometheus-2.15.2.windows-amd64.tar.gz file.

    2. Unzip the prometheus-2.15.2.windows-amd64.tar.gz file.
    3. Edit the prometheus.yml file in a text editor.
    4. Configure the scrape_configs section of prometheus.yml configuration file as follows to have prometheus scrape the Windows wmi_exporter.

       scrape_configs:
        # The job name is added as a label `job=<job_name>` to any timeseries scraped from this configuration.
        - job_name: 'wmi_exporter'

           static_configs:
           - targets: ['localhost:9182']

            labels:
              region: us-east
              instance: <HOSTNAME>
              job: <JOBNAME>

      where,

      • <HOSTNAME> is the name of the Windows system
      • <JOBNAME> is a custom attribute that you can set to identify the role of the node that you are scraping, and you can also use to scope the data in Sysdig
    5. Add the remote_write configuration at the end of the prometheus.yml file to configure the target Sysdig instance that will receive the metrics.

       remote_write:
        - url: "ENDPOINT/prometheus/remote/write"

           bearer_token_file: C:\Users\Administrator\prom\sysdig-apikey

           write_relabel_configs:
            # Drop forwarding the metrics generated by the exporter that are not supported
            - source_labels: ["__name__"]
               regex: "^wmi_(.*)"
               action: keep

            - regex: "(__name__)|(job)|(region)|(instance)|(status)|(core)|(name)|(start_mode)|(nic)|(volume)|(state)|(version)|(mode)|(branch)|(timezone)|(goversion)|(collector)|(revision)"
               action: labelkeep


      where,

      • ENDPOINT is the Sysdig collector endpoint. For the list of endpoints, see Sysdig Collector endpoints.

      • sysdig-apikey is the file that contains the Sysdig Monitor API Token. The file name does not have an extension.
        For information about how to get the API token, see Getting the Sysdig API token.
        Example: Completed version of the prometheus.yml

         # my global config
        global:
           scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
           evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
          # scrape_timeout is set to the global default (10s).

        # Alertmanager configuration
         alerting:
           alertmanagers:
          - static_configs:
            - targets:
              # - alertmanager:9093

        # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
         rule_files:
          # - "first_rules.yml"
          # - "second_rules.yml"

        # A scrape configuration containing exactly one endpoint to scrape:
        # Here it's Prometheus itself.
         scrape_configs:
           # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
           - job_name: '
        wmi_exporter'

             static_configs:
             - targets: ['
        localhost:9182']

               labels:
                 instance: "my-windows-hostname"
                 region: "us-south"

        # Connection to sysdig
         remote_write:
          - url: "https://ingest.prws.eu-de.monitoring.cloud.ibm.com/prometheus/remote/write"

             bearer_token_file: C:\Users\Administrator\prom\sysdig-api

             write_relabel_configs:
              - source_labels: ["__name__"]
                 regex: "^wmi_(.*)"
                 action: keep

              - regex: "(__name__)|(job)|(region)|(instance)|(status)|(core)|(name)|(start_mode)|(nic)|(volume)|(state)|(version)|(mode)|(branch)|(timezone)|(goversion)|(collector)|(revision)"
                 action: labelkeep
    6. Start the Prometheus executable from the location containing the prometheus.yml file. Run .\prometheus.exe.
  4. To monitor Windows systems metrics, use the default dashboard Windows Node Overview to view the Windows metrics. This default dashboard is located in the Hosts and Containers section.
  5. (Optional) Verify the uptime for Windows with Prometheus Blackbox exporter. For details, see Verifying uptime for Windows with Prometheus Blackbox exporter.

Metrics provided for Linux systems

BMC Helix Continuous Optimization

IBM Cloud metric

IBM Cloud metric label key

Time Aggregation type

Group Aggregation type

Formula

Description

NET_IN_BIT_RATE

net.bytes.in

 

timeAvg

sum

net.bytes.in*8

This metric displays the inbound network bytes.

NET_OUT_BIT_RATE

net.bytes.out

 

timeAvg

sum

net.bytes.out*8

This metric displays the outbound network bytes.

NET_BIT_RATE

net.bytes.total

 

timeAvg

sum

net.bytes.total*8

This metric displays the total network bytes.

NET_CONNECTION_RATE

net.connection.count.total

 

timeAvg

sum

 

This metric displays the number of currently established connections.

NET_IN_ERROR_RATE

net.error.count

 

timeAvg

sum

 

This metric displays the number of network errors.

CPU_USED_NUM

cpu.cores.used

 

timeAvg

avg

 

This metric displays the CPU core usage of each container obtained from cgroups; and is equal to the number of cores used by the container.

CPU_UTIL_IDLE

cpu.idle.percent

 

avg

avg

cpu.idle.percent/100

This metric displays the percentage of time that the CPU/s were idle and the system did not have an outstanding disk I/O request. 

CPU_UTIL_WAIO

cpu.iowait.percent

 

avg

avg

cpu.iowait.percent/100

This metric displays the percentage of time that the CPU/s were idle during which the system had an outstanding disk I/O request.

CPU_UTIL_WAIT

cpu.stolen.percent

 

avg

avg

cpu.stolen.percent/100

This metric measures the percentage of time that a virtual machine's CPU is in a state of involuntary wait because the physical CPU is shared among virtual machines.

CPU_UTIL_NICE

cpu.nice.percent

 

avg

avg

cpu.nice.percent/100

This metric displays the percentage of CPU utilization that occurred while executing at the user level with Nice priority.

CPU_UTIL_SYSTEM

cpu.system.percent

 

avg

avg

cpu.system.percent/100

This metric displays the percentage of CPU utilization that occurred while executing at the system level (kernel). 

CPU_UTIL

cpu.used.percent

 

avg

avg

cpu.used.percent/100

This metric displays the CPU usage for each host is obtained from /proc, and measured as the sum of the CPU usage of all cores, normalized by dividing by the number of cores.

CPU_UTIL_USER

cpu.user.percent

 

avg

avg

cpu.user.percent/100

This metric displays the percentage of CPU utilization that occurred while executing at the user level (application). 

BYFS_FREE

fs.bytes.free

fs.mountDir

 

avg

 

This metric displays the available filesystem space.

BYFS_SIZE

fs.bytes.total

fs.mountDir

timeAvg

avg

 

This metric displays the total filesystem size.

BYFS_USED

fs.bytes.used

fs.mountDir

timeAvg

avg

 

This metric displays the used filesystem space.

BYFS_USED_SPACE_PCT

fs.bytes.used

fs.bytes.total

fs.mountDir

fs.mountDir

timeAvg

timeAvg

avg

avg

fs.bytes.used / fs.bytes.total

This metric displays the percentage of used disk space of a specific filesystem.

TOTAL_FS_UTIL

fs.used.percent

 

avg

avg

 

This metric displays the amount of space written by a single container instance.

TOTAL_FS_FREE

fs.bytes.free

 

timeAvg

avg

 

This metric displays the amount of free disk space on all filesystems, in Bytes.

TOTAL_FS_SIZE

fs.bytes.total

 

timeAvg

avg

 

This metric displays the total size of all the filesystems, in Bytes

TOTAL_FS_USED

fs.bytes.used

 

timeAvg

avg

 

This metric displays the amount of used disk space on all filesystems, in Bytes.

DISK_USED_INODES_PCT

fs.inodes.used.percent

 

avg

avg

 

 

BYFS_TOTAL_INODES

fs.inodes.total.count

fs.mountDir

timeAvg

avg

 

 

BYFS_USED_INODES

fs.inodes.used.count

fs.mountDir

timeAvg

avg

 

 

BYFS_FREE_INODES

fs.inodes.total.count

fs.inodes.used.count

fs.mountDir

timeAvg

avg

fs.inodes.total.count - fs.inodes.used.count

 

MEM_FREE

memory.bytes.available

 

timeAvg

avg

 

This metric displays the amount of available memory. 

MEM_USED

memory.bytes.used

 

timeAvg

avg

 

This metric displays the amount of physical memory currently in use. 

MEM_VIRTUAL_TOTAL

memory.bytes.virtual

 

timeAvg

avg

 

This metric displays the virtual memory size of the process, in bytes. 

DISK_PAGING_IO_RATE

memory.pageFault.major

 

timeAvg

sum

 

This metric displays the count of the condition that occurs when a program accesses a memory page that is mapped in the virtual address space, but not loaded in physical memory. 

TOTAL_REAL_MEM

memory.bytes.total

 

timeAvg

avg

 

This metric displays the total memory of a host, in bytes.

SWAP_SPACE_FREE

memory.swap.bytes.available

 

timeAvg

avg

 

This metric displays the swap memory available.

SWAP_SPACE_TOT

memory.swap.bytes.total

 

timeAvg

avg

 

This metric displays the total amount of swap memory.

SWAP_SPACE_USED

memory.swap.bytes.used

 

timeAvg

avg

 

This metric displays the amount of swap memory used.

SWAP_SPACE_UTIL

memory.swap.used.percent

 

avg

avg

 

This metric displays the percentage of swap memory used.

MEM_UTIL

memory.used.percent

 

avg

avg

 

This metric displays the percentage of physical memory in use.

UPTIME

uptime

 

timeAvg

avg

(1-uptime)*3600

This metric displays the percentage of time the selected entity or entities were down over the defined time window.

Metrics provided for Windows systems

BMC Helix Continuous Optimization

IBM Cloud metric

IBM Cloud metric label key

Time Aggregation type

Group Aggregation type

Formula

Description

CPU_MHZ

wmi_cpu_core_frequency_mhz

 

avg

avg

 

This metric displays the core frequency.

BYLDISK_IO_READ_RATE

wmi_logical_disk_reads_total

volume

timeAvg

avg

 

 

DISK_IO_READ_RATE

wmi_logical_disk_reads_total

 

timeAvg

sum

 

 

BYLDISK_IO_WRITE_RATE

wmi_logical_disk_writes_total

volume

timeAvg

avg

 

 

DISK_IO_WRITE_RATE

wmi_logical_disk_writes_total

 

timeAvg

sum

 

 

BYLDISK_SIZE

wmi_logical_disk_size_bytes

volume

avg

avg

 

 

BYLDISK_FREE_SPACE

wmi_logical_disk_free_bytes

volume

avg

avg

 

 

BYLDISK_USED_SPACE

wmi_logical_disk_size_bytes

wmi_logical_disk_free_bytes

volume

avg

avg

wmi_logical_disk_size_bytes - wmi_logical_disk_free_bytes

 

TOTAL_LDISK_SIZE

wmi_logical_disk_size_bytes

 

avg

sum

 

 

LDISK_FREE

wmi_logical_disk_free_bytes

 

avg

sum

 

 

TOTAL_LDISK_USED

wmi_logical_disk_size_bytes

wmi_logical_disk_free_bytes

 

avg

sum

wmi_logical_disk_size_bytes - wmi_logical_disk_free_bytes

 

BYLDISK_READ_RESPONSE_TIME

wmi_logical_disk_read_seconds_total

volume

timeAvg

avg

 

 

DISK_READ_RESPONSE_TIME

wmi_logical_disk_read_seconds_total

 

timeAvg

sum

 

 

BYLDISK_WRITE_RESPONSE_TIME

wmi_logical_disk_write_seconds_total

volume

timeAvg

avg

 

 

DISK_WRITE_RESPONSE_TIME

wmi_logical_disk_write_seconds_total

 

timeAvg

sum

 

 

DISK_IO_RATE

wmi_logical_disk_reads_total

wmi_logical_disk_writes_total

volume

timeAvg

timeAvg

sum

sum

wmi_logical_disk_reads_total + wmi_logical_disk_writes_total

This metric displays the disk Average I/O Rate aggregated by the host.

NET_OUT_BIT_RATE

wmi_net_bytes_sent_total

 

timeAvg

sum

wmi_net_bytes_sent_total*8

This metric displays the total bytes transmitted by interface.

NET_IN_BIT_RATE

wmi_net_bytes_received_total

 

timeAvg

sum

wmi_net_bytes_received_total*8

This metric displays the total bytes received by interface.

NET_BIT_RATE

wmi_net_bytes_total

 

timeAvg

sum

wmi_net_bytes_total*8

This metric displays the total bytes received and transmitted by interface.

NET_OUT_PKT_ERROR_RATE

wmi_net_packets_outbound_errors

 

timeAvg

sum

 

This metric displays the total packets that could not be transmitted due to errors.

NET_IN_PKT_ERROR_RATE

wmi_net_packets_received_errors

 

timeAvg

sum

 

This metric displays the total packets that could not be received due to errors.

NET_IN_PKT_RATE

wmi_net_packets_received_total

 

timeAvg

sum

 

This metric displays the total packets received by interface.

NET_OUT_PKT_RATE

wmi_net_packets_sent_total

 

timeAvg

sum

 

This metric displays the total packets transmitted by interface.

NET_PKT_RATE

wmi_net_packets_total

 

timeAvg

sum

 

This metric displays the total packets received and transmitted by interface.

NET_BANDWIDTH

wmi_net_current_bandwidth

 

avg

sum

 

This metric displays the estimate of the interface's current bandwidth.

MEM_VIRTUAL_FREE

wmi_os_virtual_memory_free_bytes

 

avg

sum

 

This metric displays the bytes of virtual memory currently unused and available

TOTAL_REAL_MEM

wmi_cs_physical_memory_bytes

 

timeAvg

sum

 

This metric displays the total installed physical memory.

MEM_FREE

wmi_os_physical_memory_free_bytes

 

avg

sum

 

This metric displays the bytes of physical memory currently unused and available.

MEM_USED

wmi_os_visible_memory_bytes

wmi_os_physical_memory_free_bytes

 

avg

 

sum

wmi_os_visible_memory_bytes - wmi_os_physical_memory_free_bytes

This metric displays the total used memory, in bytes.

MEM_UTIL

wmi_os_visible_memory_bytes

wmi_os_physical_memory_free_bytes

 

avg

sum

(wmi_os_visible_memory_bytes - wmi_os_physical_memory_free_bytes) / wmi_os_visible_memory_bytes

This metric displays the percentage of physical memory in use during the interval.

MEM_VIRTUAL_TOTAL

wmi_os_virtual_memory_bytes

 

avg

sum

 

This metric displays the bytes of virtual memory.

PROCESS_NUM_RUNNING

wmi_os_processes

 

avg

sum

 

This metric displays the number of process contexts currently loaded or running on the operating system.

MEM_COMMIT_LIMIT

wmi_os_paging_limit_bytes

 

avg

sum

 

This metric displays the total number of bytes that can be sorted in the operating system paging files. 

REQ_QUEUED

wmi_system_processor_queue_length

 

avg

avg

 

This metric displays the number of threads in the processor queue. 

UPTIME

wmi_system_system_up_time

 

avg

avg

 

This metric displays the time of last boot of system.

CPU_UTIL_IDLE

wmi_cpu_time_total

Label key: mode

Label value: idle

timeAvg

 

wmi_cpu_time_total

 

CPU_UTIL_USER

wmi_cpu_time_total

Label key: mode

Label value: user

timeAvg

 

wmi_cpu_time_total

 

CPU_UTIL_INTERRUPT_HANDLING

wmi_cpu_time_total

Label key: mode

Label value: interrupt

timeAvg

 

wmi_cpu_time_total

 

CPU_UTIL_SYSTEM

wmi_cpu_time_total

Label key: mode

Label value: privileged

timeAvg

 

wmi_cpu_time_total

 

CPU_UTIL

wmi_cpu_time_total_privileged

wmi_cpu_time_total_user

wmi_cpu_time_total_privileged

  • Label key: mode
  • Label value: privileged

wmi_cpu_time_total_user

  • Label key: mode
  • Label value: user

timeAvg

 

wmi_cpu_time_total_privileged + wmi_cpu_time_total_user

 

Troubleshooting Sysdig installation failure in Linux

If the Sysdig installation fails, perform the following tasks based on your operating system:

Type of error

Troubleshooting steps

Kernel headers are not available

Install the kernel headers manually

For Debian or Ubuntu Linux distribution

  1. Select a distribution (cat /etc/os-release)
  2. Run the following command for the selected distribution:
    apt-get -y install linux-headers-$(uname -r)
  3. If the error still persists, run the following command:
    yum install kernel kernel-headers
  4. Deploy the Sysdig agent

For RHEL, CentOS, and Fedora Linux distributions

  1. Select a distribution (cat /etc/os-release)
  2. Run the following command for the selected distribution:
    yum -y install kernel-devel-$(uname -r)
  3. If the error still persists, run the following command:
    yum install kernel kernel-headers
  4. Deploy the Sysdig agent

sysdig-probe kernel module is not installed on the kernel

  1. Install the kernel module using the following command:
    yum install kernel kernel-headers
  2. Deploy the Sysdig agent

Installation fails because the dkms_autoinstaller service is stopped

  1. Use the following commands to start the service:
    sudo yum -y install kernel-devel-$(uname -r)
    sudo /usr/lib/dkms/dkms_autoinstaller start
  2. Deploy the Sysdig agent

The kernel packages are not available

  1. Run the following command to get the names of the packages that are not available. The package names are available in the error that is generated after running the following command:
    yum -y install kernel-devel-$(uname -r)
  2. Download the package from the https://rpmfind.net/linux/rpm2html/search.php?query=kernel-devel-x86_64 using the wget command
  3. Install the missing package:
    sudo yum localinstall <RPM file name>
  4. Install the kernel headers:
    sudo yum -y install kernel-devel-$(uname -r)
    yum install kernel kernel-headers

If the Sysdig agent installation still fails, check the logs at /opt/draios/etc/draios.log and raise support case with the IBM Support team.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*