Collecting additional metrics using the Sysdig agent

You can use the Sysdig agent to collect the additional metrics from your Linux and Windows virtual servers. These metrics are useful for gaining operational visibility into the performance and health of your applications, services, and platforms. The Sysdig agent collects these metrics and sends them to Sysdig instance. When you run the IBM Cloud API ETL, these metrics are imported into the BMC Helix Continuous Optimization database.

Collecting Sysdig performance metrics from Linux virtual server

Log in to the virtual server by using your public IP address and root user name.
Provision an instance of the IBM Cloud Monitoring.
1. One Sysdig service instance must be provisioned for each region. The user creating Sysdig instance must have "IBM Cloud Monitoring" privileges to create Sysdig instance.
  Steps to manage user access
  1. Log in to the IBM Cloud console.
  2. In the IBM Cloud Console header, click Manage > Access (IAM).
  3. From the left navigation page, select Users.
  4. In the Account users table, identify the user to whom you want to assign the access. From the Actions menu of that user, click Assign access.
  5. Select Assign access within a resource group.
  6. Select a resource group.
  7. If the user does not have a role already granted for the selected resource group, choose a role for the Assign access to a resource group field.
    Depending on the role that you select, the user can view the resource group on their dashboard, edit the resource group name, or manage user access to the group. You can select No access, if you want the user to have access only to the IBM Cloud Monitoring in the resource group.
  8. Select IBM Cloud Monitoring.
  9. Select the platform role Administrator.
  10. Click Assign.
2. Steps to provision an instance of the IBM Cloud Monitoring service
  To add monitoring features with IBM Cloud Monitoring in the IBM Cloud, you need to provision an instance of the IBM Cloud Monitoring service. You provision an instance within the context of a resource group. A resource group lets you organize your services for access control and billing purposes. You can provision the IBM Cloud Monitoring with Sysdig instance in the default resource group or in a custom resource group. When you provision an instance, you automatically get an ingestion key, known as the Sysdig access key.
  1. Log in to the IBM Cloud console.
  2. From the IBM Cloud dashboard, navigate to the menu > Observability to access the Observability dashboard.
  3. Select Monitoring > Options > Create.
  4. Select the region.
  5. Select a service plan. By default, the Trial plan is set. For more information about the service plans, see Service plans.
  6. Enter a service name.
  7. Select a resource group. By default, the Default resource group is set.
  8. Set on automatic collection of platform metrics by clicking Enable.
  9. Click Create to provision an instance.
    The service UI is displayed.
  To provision an instance of Sysdig by using the CLI, see Provisioning a Sysdig instance by using the CLI.
3. Steps to configure a Sysdig agent
  To configure your Linux host (Ubuntu server) to send metrics to your IBM Cloud Monitoring instance, install a Sysdig agent.
  Complete the following steps from the command line:
  1. Open the terminal.
  2. Run the following command to log in to the IBM Cloud:
    ibmcloud login -a cloud.ibm.com
    Select the account where the IBM Cloud Monitoring instance is available.
  3. Obtain the Sysdig access key.
    Log in to the IBM Cloud console. .
    From the left navigation page, select Observability.
    Select Monitoring. The IBM Cloud Monitoring dashboard is displayed. A list of monitoring instances that are available on IBM Cloud are displayed.
    Identify the instance for which you want to get the access key. Select actions, then click View Key. A pop up window opens with the key information.
    Click the eye icon to view the access key.
    
    To obtain the access key by using the CLI, see Getting the access key by using the CLI
  4. Obtain the IBM region list. For information, see Regions and endpoints. .
  5. Obtain the region-specific public collector endpoint. For information, see Public collector endpoints.
  6. Run the following command to deploy the Sysdig agent on the virtual server:
    curl -s https://s3.amazonaws.com/download.draios.com/stable/install-agent | sudo bash -s -- --access_key <SYSDIG_ACCESS_KEY> --collector <COLLECTOR_ENDPOINT> --collector_port 6443 --secure true --check_certificate false --tags TAG_DATA --additional_conf 'sysdig_capture_enabled: false'
    
    Example:
    command for frankfurt region from our environment: curl -s https://s3.amazonaws.com/download.draios.com/stable/install-agent | sudo bash -s -- --access_key 2cefff44-4cba-4c8d-afc0-a8563ee8049a --collector ingest.eu-de.monitoring.cloud.ibm.com --collector_port 6443 --secure true --check_certificate false --tags type:sysdig-agent,location:frankfurt,sourceType:virtualserver --additional_conf 'sysdig_capture_enabled: false'
  7. Verify the status of the dragent service
    Run the command: systemctl status dragent.service
  After the installation is done, check the contents of the /opt/draios/etc/dragent.yaml file. The values of ssl, ssl_verify_certificate, and sysdig_capture_enabled properties must be set to the following:
  - ssl: true
  - ssl_verify_certificate: false
  - sysdig_capture_enabled: false
  If these values are not correct in the dragent.yml file, set these properties manually and save the file.
  (Optional) To filter the metrics, add the metrics_filter property in dragent.yaml file. For details, see Including and excluding metrics.
  To view the metrics in the IBM Cloud Sysdig UI, launch the Sysdig Web UI. For details, see Launch the Web UI. . In the Host and containers section, you can find the entry for your Ubuntu server.

Collecting Sysdig Performance metrics from Windows virtual server

The Prometheus WMI exporter runs as a Windows service. You can configure the metrics that you want to monitor by enabling the collectors.

The following collectors are supported by IBM:

CPU
Computer system metrics (cs)
Disk metrics
Network interface metrics

Configure the Prometheus WMI exporter
1. Log in to your Windows computer.
2. Download the Prometheus exporter.
  BMC Helix Continuous Optimization does not support v0.13.0 and later versions of the Prometheus exporter.
3. Identify the collectors that contain the information for the metric data that you want to send to the Sysdig agent.
4. Run the wmi_exporter and configure the collectors that you want to enable.
  .\wmi_exporter-0.12.0-amd64.exe --collectors.enabled <COLLECTORS>
  where, <COLLECTORS> indicates the list of connectors that you want to configure
  Example: To collect computer system metrics (cs), CPU metrics, disk metrics, and network interface I/O metrics use the following command:
  .\wmi_exporter-0.10.2-amd64.exe --collectors.enabled "os,cpu,logical_disk,net,system"
Note: The ETL does not support the latest version of the wmi exporter. Ensure that you download the 0.12.0 version (known as wmi_exporter and not the windows_exporter) of the exporter.
(Optional) Configure the network settings
1. Enable the Windows firewall to allow access to wmi_exporter-0.12.0-amd64.exe.
2. (Optional) Update the VPC rules. If you use private endpoints, add an inbound rule to the security group for port 9182 with source type = Security Group and choose the security group for the Windows system.
Collect metrics by running Prometheus as a client collector on Windows
Use the Prometheus remote-write capabilities to push the metrics from the Windows system by running Prometheus as a client collector on Windows.
1. Download the Prometheus monitoring system and time series database. Download prometheus-2.15.2.windows-amd64.tar.gz file.
2. Unzip the prometheus-2.15.2.windows-amd64.tar.gz file.
3. Edit the prometheus.yml file in a text editor.
4. Configure the scrape_configs section of prometheus.yml configuration file as follows to have prometheus scrape the Windows wmi_exporter.
```
 scrape_configs:
   # The job name is added as a label `job=<job_name>` to any timeseries scraped from this configuration.
   - job_name: 'wmi_exporter'

     static_configs:
      - targets: ['localhost:9182']

      labels:
        region: us-east
        instance: <HOSTNAME>
        job: <JOBNAME>
```
  where,
  - <HOSTNAME> is the name of the Windows system
  - <JOBNAME> is a custom attribute that you can set to identify the role of the node that you are scraping, and you can also use to scope the data in Sysdig
5. Add the remote_write configuration at the end of the prometheus.yml file to configure the target Sysdig instance that will receive the metrics.
```
 remote_write:
   - url: "ENDPOINT/api/prometheus/write"

     bearer_token_file: C:\Users\Administrator\prom\sysdig-apikey

     write_relabel_configs:
       # Drop forwarding the metrics generated by the exporter that are not supported
       - source_labels: ["__name__"]
         regex: "^wmi_(.*)"
         action: keep

       - regex: "(__name__)|(job)|(region)|(instance)|(status)|(core)|(name)|(start_mode)|(nic)|(volume)|(state)|(version)|(mode)|(branch)|(timezone)|(goversion)|(collector)|(revision)"
         action: labelkeep
```
  where,
  - ENDPOINT is the Sysdig collector endpoint. For the list of endpoints, see Sysdig Collector endpoints.
  - sysdig-apikey is the file that contains the Sysdig Monitor API Token. The file name does not have an extension.
    For information about how to get the API token, see Getting the Sysdig API token.
    Example: Completed version of the prometheus.yml
    # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'wmi_exporter' static_configs: - targets: ['localhost:9182'] labels: instance: "my-windows-hostname" region: "us-south" # Connection to sysdig remote_write: - url: "https://ingest.eu-gb.monitoring.cloud.ibm.com/api/prometheus/write" bearer_token_file: C:\Users\Administrator\prom\sysdig-api write_relabel_configs: - source_labels: ["__name__"] regex: "^wmi_(.*)" action: keep - regex: "(__name__)|(job)|(region)|(instance)|(status)|(core)|(name)|(start_mode)|(nic)|(volume)|(state)|(version)|(mode)|(branch)|(timezone)|(goversion)|(collector)|(revision)" action: labelkeep
6. Start the Prometheus executable from the location containing the prometheus.yml file. Run .\prometheus.exe.
To monitor Windows systems metrics, use the default dashboard Windows Node Overview to view the Windows metrics. This default dashboard is located in the Hosts and Containers section.
(Optional) Verify the uptime for Windows with Prometheus Blackbox exporter. For details, see Verifying uptime for Windows with Prometheus Blackbox exporter.

Metrics provided for Linux systems

BMC Helix Continuous Optimization	IBM Cloud metric	IBM Cloud metric label key	Time Aggregation type	Group Aggregation type	Formula	Description
NET_IN_BIT_RATE	net.bytes.in		timeAvg	sum	net.bytes.in*8	This metric displays the inbound network bytes.
NET_OUT_BIT_RATE	net.bytes.out		timeAvg	sum	net.bytes.out*8	This metric displays the outbound network bytes.
NET_BIT_RATE	net.bytes.total		timeAvg	sum	net.bytes.total*8	This metric displays the total network bytes.
NET_CONNECTION_RATE	net.connection.count.total		timeAvg	sum		This metric displays the number of currently established connections.
NET_IN_ERROR_RATE	net.error.count		timeAvg	sum		This metric displays the number of network errors.
CPU_USED_NUM	cpu.cores.used		timeAvg	avg		This metric displays the CPU core usage of each container obtained from cgroups; and is equal to the number of cores used by the container.
CPU_UTIL_IDLE	cpu.idle.percent		avg	avg	cpu.idle.percent/100	This metric displays the percentage of time that the CPU/s were idle and the system did not have an outstanding disk I/O request.
CPU_UTIL_WAIO	cpu.iowait.percent		avg	avg	cpu.iowait.percent/100	This metric displays the percentage of time that the CPU/s were idle during which the system had an outstanding disk I/O request.
CPU_UTIL_WAIT	cpu.stolen.percent		avg	avg	cpu.stolen.percent/100	This metric measures the percentage of time that a virtual machine's CPU is in a state of involuntary wait because the physical CPU is shared among virtual machines.
CPU_UTIL_NICE	cpu.nice.percent		avg	avg	cpu.nice.percent/100	This metric displays the percentage of CPU utilization that occurred while executing at the user level with `Nice` priority.
CPU_UTIL_SYSTEM	cpu.system.percent		avg	avg	cpu.system.percent/100	This metric displays the percentage of CPU utilization that occurred while executing at the system level (kernel).
CPU_UTIL	cpu.used.percent		avg	avg	cpu.used.percent/100	This metric displays the CPU usage for each host is obtained from /proc, and measured as the sum of the CPU usage of all cores, normalized by dividing by the number of cores.
CPU_UTIL_USER	cpu.user.percent		avg	avg	cpu.user.percent/100	This metric displays the percentage of CPU utilization that occurred while executing at the user level (application).
BYFS_FREE	fs.bytes.free	fs.mountDir		avg		This metric displays the available filesystem space.
BYFS_SIZE	fs.bytes.total	fs.mountDir	timeAvg	avg		This metric displays the total filesystem size.
BYFS_USED	fs.bytes.used	fs.mountDir	timeAvg	avg		This metric displays the used filesystem space.
BYFS_USED_SPACE_PCT	fs.bytes.used fs.bytes.total	fs.mountDir fs.mountDir	timeAvg timeAvg	avg avg	fs.bytes.used / fs.bytes.total	This metric displays the percentage of used disk space of a specific filesystem.
TOTAL_FS_UTIL	fs.used.percent		avg	avg		This metric displays the amount of space written by a single container instance.
TOTAL_FS_FREE	fs.bytes.free		timeAvg	avg		This metric displays the amount of free disk space on all filesystems, in Bytes.
TOTAL_FS_SIZE	fs.bytes.total		timeAvg	avg		This metric displays the total size of all the filesystems, in Bytes
TOTAL_FS_USED	fs.bytes.used		timeAvg	avg		This metric displays the amount of used disk space on all filesystems, in Bytes.
DISK_USED_INODES_PCT	fs.inodes.used.percent		avg	avg
BYFS_TOTAL_INODES	fs.inodes.total.count	fs.mountDir	timeAvg	avg
BYFS_USED_INODES	fs.inodes.used.count	fs.mountDir	timeAvg	avg
BYFS_FREE_INODES	fs.inodes.total.count fs.inodes.used.count	fs.mountDir	timeAvg	avg	fs.inodes.total.count - fs.inodes.used.count
MEM_FREE	memory.bytes.available		timeAvg	avg		This metric displays the amount of available memory.
MEM_USED	memory.bytes.used		timeAvg	avg		This metric displays the amount of physical memory currently in use.
MEM_VIRTUAL_TOTAL	memory.bytes.virtual		timeAvg	avg		This metric displays the virtual memory size of the process, in bytes.
DISK_PAGING_IO_RATE	memory.pageFault.major		timeAvg	sum		This metric displays the count of the condition that occurs when a program accesses a memory page that is mapped in the virtual address space, but not loaded in physical memory.
TOTAL_REAL_MEM	memory.bytes.total		timeAvg	avg		This metric displays the total memory of a host, in bytes.
SWAP_SPACE_FREE	memory.swap.bytes.available		timeAvg	avg		This metric displays the swap memory available.
SWAP_SPACE_TOT	memory.swap.bytes.total		timeAvg	avg		This metric displays the total amount of swap memory.
SWAP_SPACE_USED	memory.swap.bytes.used		timeAvg	avg		This metric displays the amount of swap memory used.
SWAP_SPACE_UTIL	memory.swap.used.percent		avg	avg		This metric displays the percentage of swap memory used.
MEM_UTIL	memory.used.percent		avg	avg		This metric displays the percentage of physical memory in use.
UPTIME	uptime		timeAvg	avg	(1-uptime)*3600	This metric displays the percentage of time the selected entity or entities were down over the defined time window.

Metrics provided for Windows systems

BMC Helix Continuous Optimization	IBM Cloud metric	IBM Cloud metric label key	Time Aggregation type	Group Aggregation type	Formula	Description
CPU_MHZ	wmi_cpu_core_frequency_mhz		avg	avg		This metric displays the core frequency.
BYLDISK_IO_READ_RATE	wmi_logical_disk_reads_total	volume	timeAvg	avg
DISK_IO_READ_RATE	wmi_logical_disk_reads_total		timeAvg	sum
BYLDISK_IO_WRITE_RATE	wmi_logical_disk_writes_total	volume	timeAvg	avg
DISK_IO_WRITE_RATE	wmi_logical_disk_writes_total		timeAvg	sum
BYLDISK_SIZE	wmi_logical_disk_size_bytes	volume	avg	avg
BYLDISK_FREE_SPACE	wmi_logical_disk_free_bytes	volume	avg	avg
BYLDISK_USED_SPACE	wmi_logical_disk_size_bytes wmi_logical_disk_free_bytes	volume	avg	avg	wmi_logical_disk_size_bytes - wmi_logical_disk_free_bytes
TOTAL_LDISK_SIZE	wmi_logical_disk_size_bytes		avg	sum
LDISK_FREE	wmi_logical_disk_free_bytes		avg	sum
TOTAL_LDISK_USED	wmi_logical_disk_size_bytes wmi_logical_disk_free_bytes		avg	sum	wmi_logical_disk_size_bytes - wmi_logical_disk_free_bytes
BYLDISK_READ_RESPONSE_TIME	wmi_logical_disk_read_seconds_total	volume	avg	avg
DISK_READ_RESPONSE_TIME	wmi_logical_disk_read_seconds_total		avg	sum
BYLDISK_WRITE_RESPONSE_TIME	wmi_logical_disk_write_seconds_total	volume	avg	avg
DISK_WRITE_RESPONSE_TIME	wmi_logical_disk_write_seconds_total		avg	sum
DISK_IO_RATE	wmi_logical_disk_reads_total wmi_logical_disk_writes_total	volume	timeAvg timeAvg	sum sum	wmi_logical_disk_reads_total + wmi_logical_disk_writes_total	This metric displays the disk Average I/O Rate aggregated by the host.
NET_OUT_BIT_RATE	wmi_net_bytes_sent_total		avg	sum	wmi_net_bytes_sent_total*8	This metric displays the total bytes transmitted by interface.
NET_IN_BIT_RATE	wmi_net_bytes_received_total		avg	sum	wmi_net_bytes_received_total*8	This metric displays the total bytes received by interface.
NET_BIT_RATE	wmi_net_bytes_total		avg	sum	wmi_net_bytes_total*8	This metric displays the total bytes received and transmitted by interface.
NET_OUT_PKT_ERROR_RATE	wmi_net_packets_outbound_errors		avg	sum		This metric displays the total packets that could not be transmitted due to errors.
NET_IN_PKT_ERROR_RATE	wmi_net_packets_received_errors		timeAvg	sum		This metric displays the total packets that could not be received due to errors.
NET_IN_PKT_RATE	wmi_net_packets_received_total		avg	sum		This metric displays the total packets received by interface.
NET_OUT_PKT_RATE	wmi_net_packets_sent_total		avg	sum		This metric displays the total packets transmitted by interface.
NET_PKT_RATE	wmi_net_packets_total		avg	sum		This metric displays the total packets received and transmitted by interface.
NET_BANDWIDTH	wmi_net_current_bandwidth		avg	sum		This metric displays the estimate of the interface's current bandwidth.
MEM_VIRTUAL_FREE	wmi_os_virtual_memory_free_bytes		avg	sum		This metric displays the bytes of virtual memory currently unused and available
TOTAL_REAL_MEM	wmi_cs_physical_memory_bytes		timeAvg	sum		This metric displays the total installed physical memory.
MEM_FREE	wmi_os_physical_memory_free_bytes		avg	sum		This metric displays the bytes of physical memory currently unused and available.
MEM_USED	wmi_os_visible_memory_bytes wmi_os_physical_memory_free_bytes		avg	sum	wmi_os_visible_memory_bytes - wmi_os_physical_memory_free_bytes	This metric displays the total used memory, in bytes.
MEM_UTIL	wmi_os_visible_memory_bytes wmi_os_physical_memory_free_bytes		avg	sum	(wmi_os_visible_memory_bytes - wmi_os_physical_memory_free_bytes) / wmi_os_visible_memory_bytes	This metric displays the percentage of physical memory in use during the interval.
MEM_VIRTUAL_TOTAL	wmi_os_virtual_memory_bytes		avg	sum		This metric displays the bytes of virtual memory.
PROCESS_NUM_RUNNING	wmi_os_processes		avg	sum		This metric displays the number of process contexts currently loaded or running on the operating system.
MEM_COMMIT_LIMIT	wmi_os_paging_limit_bytes		avg	sum		This metric displays the total number of bytes that can be sorted in the operating system paging files.
REQ_QUEUED	wmi_system_processor_queue_length		avg	avg		This metric displays the number of threads in the processor queue.
UPTIME	wmi_system_system_up_time		avg	avg		This metric displays the time of last boot of system.
CPU_UTIL_IDLE	wmi_cpu_time_total	Label key: mode Label value: idle	timeAvg		wmi_cpu_time_total
CPU_UTIL_USER	wmi_cpu_time_total	Label key: mode Label value: user	timeAvg		wmi_cpu_time_total
CPU_UTIL_INTERRUPT_HANDLING	wmi_cpu_time_total	Label key: mode Label value: interrupt	timeAvg		wmi_cpu_time_total
CPU_UTIL_SYSTEM	wmi_cpu_time_total	Label key: mode Label value: privileged	timeAvg		wmi_cpu_time_total
CPU_UTIL	wmi_cpu_time_total_privileged wmi_cpu_time_total_user	wmi_cpu_time_total_privileged Label key: mode Label value: privileged wmi_cpu_time_total_user Label key: mode Label value: user	timeAvg		wmi_cpu_time_total_privileged + wmi_cpu_time_total_user

Troubleshooting Sysdig installation failure in Linux

If the Sysdig installation fails, perform the following tasks based on your operating system:

Type of error	Troubleshooting steps
Kernel headers are not available	Install the kernel headers manually For Debian or Ubuntu Linux distribution Select a distribution (cat /etc/os-release) Run the following command for the selected distribution: `apt-get -y install linux-headers-$(uname -r)` If the error still persists, run the following command: `yum install kernel kernel-headers` Deploy the Sysdig agent For RHEL, CentOS, and Fedora Linux distributions Select a distribution (cat /etc/os-release) Run the following command for the selected distribution: `yum -y install kernel-devel-$(uname -r)` If the error still persists, run the following command: `yum install kernel kernel-headers` Deploy the Sysdig agent
sysdig-probe kernel module is not installed on the kernel	Install the kernel module using the following command: `yum install kernel kernel-headers` Deploy the Sysdig agent
Installation fails because the dkms_autoinstaller service is stopped	Use the following commands to start the service: `sudo yum -y install kernel-devel-$(uname -r)` `sudo /usr/lib/dkms/dkms_autoinstaller start` Deploy the Sysdig agent
The kernel packages are not available	Run the following command to get the names of the packages that are not available. The package names are available in the error that is generated after running the following command: `yum -y install kernel-devel-$(uname -r)` Download the package from the https://rpmfind.net/linux/rpm2html/search.php?query=kernel-devel-x86_64 using the `wget` command Install the missing package: `sudo yum localinstall <RPM file name>` Install the kernel headers: `sudo yum -y install kernel-devel-$(uname -r)` `yum install kernel kernel-headers`

If the Sysdig agent installation still fails, check the logs at /opt/draios/etc/draios.log and raise support case with the IBM Support team.