Troubleshooting issues reported by the monitoring solution


  

After setting up the monitoring of your on-premises BMC Helix IT Operations Management, you receive events based on the alert and alarm policies that you configured. Use the information in this section to resolve the events.

Postgres replica pod does not stream data

You receive an event based on the alarm policy you configured for PostgreSQL Replications or Primary Replication Slot Lag. 
For more information, see Configure Self-Monitoring-PostgreSQL policy.

Here is an example of an alarm policy configuration:

Alarm Policy configuration.png

To resolve the event: 

  1. To get the list of PostgreSQL pods, run the following command:

    oc get pods -n <ITOM namespace>|grep postgres-bmc-pg

    Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

  2. Log in to any of the PostgreSQL pods by running the following command:

    oc exec -it postgres-bmc-pg-ha-15-0 -n <ITOM namespace> bash

    Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

  3. To find the state of the PostgreSQL pods, run the following command:
    patronictl listPostgreSQL pod.png
    In this example, postgres-bmc-pg-ha-15-2 is not streaming data. 
  4. Reinitialize the process by running the following command:

    patronictl reinit postgres-bmc-pg-ha-15 postgres-bmc-pg-ha-15-2
  5. To verify the pod is in a streaming state, re-run the patronictl list command:

    patronictl list.png
    Pod postgres-bmc-pg-ha-15-2 is now streaming data. 

 

Elasticsearch cluster is in an unhealthy state due to unassigned shards

You receive an event based on the alarm policy configured for Elasticsearch, Shards, or unassigned shards. 

Here is an example of an alarm policy configuration:
Unassigned shards Alarm Policy configuration.png

For more information, see Configure the Self-Monitoring-Opensearch policy.

To resolve the event: 

  1. Log in to any of the event-service pods by running the following command:

    oc exec -it event-service-<pod id> bash 
  2. To verify that the issue exists, run the following command:

    curl -s  -XGET http://opensearch-events-data:9200/_cluster/health

    Sample output:
    “active_shards_percent_as_number": 80If the active_shards_percent_as_number is less than 100, it indicates that an issue exists.

  3. To resolve the issue, run the following command:

    curl -XPOST http://opensearch-events-data:9200/_cluster/reroute?retry_failed=true
  4. To confirm all shards are in the started state, run the following command:

    curl -s  -XGET http://opensearch-events-data:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason|grep -v STARTED
  5. To confirm that the cluster is in a healthy state, run the following command: 

    curl -s  -XGET http://opensearch-events-data:9200/_cluster/health
    {"cluster_name":"opensearch-events","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,
    "discovered_master":true,"active_primary_shards":144,"active_shards":348,"relocating_shards":0,"initializing_shards":0,
    "unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,
    "task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}event-service-96d8b4cdd-27fhk:/opt/bmc/deployer/bin$
  6. Confirm that the status parameter value is green and the active_shards_percent_as_number parameter value is 100.

 

Application pods are getting empty service registration information from Zookeeper

The event is based on the Log Analytics connector for Kubernetes and a Log Analytics Alert Policy:

  • The 'Filtering Rule' Regex pattern for the Agent configuration in the values.yaml file is: 

     - regex: "regex log .*ERROR.*|.*NOT_ENOUGH_REPLICAS.*|.*Exception.*|.*invalid permission.*|.*TSMicroserviceNotAvailableException.* "

     Rule configuration.png

  • An example of the Log Analytics Policy Selection Criteria in the alert policy is: 

    ( bmc_connector_name EQUALS 'log-analytics-k8s-connector' ) AND ( message CONTAINS 'com.bmc.truesight.tspodlibrary.TSMicroserviceNotAvailableException:' )

    The alert policy must have the following condition as an 'Or' criteria along with other criteria: message CONTAINS 'com.bmc.truesight.tspodlibrary.TSMicroserviceNotAvailableException:'

For more information, see Creating Alert policies in BMC Helix Log Analytics.

To resolve the event, restart the services by running the following commands:

oc -n <ITOM namespace> rollout restart deploy agent-configuration-service agent-health-monitor-service anomaly-service autoanomaly-service cli-service data-download-gateway-service deployment-repository-service hm-preprocessor-service hm-preprocessor-query-service thirdparty-entity-reconciler-service impact-management-service itil-service managed-object-service notification-service open-data-gateway-service prometheus-entity-reconciler-service tag-service ui-service user-management-service
oc -n <ITOM namespace> rollout restart deploy anomaly-detection-service event-ingestion-service event-mgmt-service event-processor-service event-service anomaly-event-ingestion-service anomaly-event-processor-service timeframe-service tmf-events-service metric-aggregation-service metric-configuration-service metric-gateway-service metric-query-service metric-ingestion-service thirdparty-ingestion-service prometheus-ingestion-service ml-model-mgmt-service 

Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

 

Application calls are waiting for PostgreSQL database read operation

You receive an event based on the alarm policy configured for PostgreSQL Environment | Instance Availability:
RunBook1.png

For more information, see Configure Self-Monitoring-PostgreSQL policy.

To resolve the event, restart the PostgreSQL pool pods by running the following command:

oc -n <ITOM namespace> rollout restart deploy <PostgreSQL pool pod>

Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

Example:

oc -n itom rollout restart deploy postgres-bmc-pg-ha-15-pool

 

Services could not connect to Redis master pod through the Redis HA proxy service

The event is based on the Log Analytics connector for Kubernetes and a Log Analytics Alert Policy. The event is generated based on the following Regex pattern set in the alert policy as an 'Or' criteria along with other criteria: message CONTAINS 'org.redisson.client.RedisException: READONLY'

For more information, see "To create Alert policies in BMC Helix Log Analytics" in Configuring the log collection and alert policies.

To resolve the event, restart the Redis HA proxy pods by running the following command:

oc -n <ITOM namespace> rollout restart deploy redis-redis-ha-haproxy  

Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

 

Postgres Replica Pod does not stream the data to the standby pod

You receive an event based on the alarm policy configured for the Standby Replication Lag attribute under the PostgreSQL Replications application class. 

PostgreReplicaStandyPod1.png

To resolve the event:

  1. Log in to the postgres-bmc-pg-ha-15-0 pod:

    oc exec -it postgres-bmc-pg-ha-15-0 bash
  2. Run the command patronictl list to validate whether the pods are streaming data.
    PostgreReplicaStandyPod2.png

    In the example, postgres-bmc-pg-ha-15-2 is not streaming data.

  3. Run the following command to reinitialize the process:

    patronictl reinit postgres-bmc-pg-ha-15 postgres-bmc-pg-ha-15-2

    PostgreReplicaStandyPod3.png

  4. Re-run the patronictl list command to verify the pod is streaming data.
     

 

Elasticsearch cluster is in an unhealthy state

You receive an event based on the alarm policy configured for Elasticsearch, Cluster or Cluster Status.  

Example:

ElasticSearchUnhealthy1.png

To resolve the event:

  1. Log into any of the event-service pods.

    oc exec -it event-service-96d8b4cdd-27fhk bash
  2. To verify that the issue exists, run the following command:

    curl -s  -XGET http://opensearch-events-data:9200/_cluster/health

    Sample output:
    {"cluster_name":"opensearch-events","status":"red","timed_out":false
    If the status of the cluster is not green, it indicates an issue exists.
    If the active_shards_percent_as_number is less than 100, it indicates that an issue exists.

  3. To fix the issue, run the following curl command:

    curl -XPOST http://opensearch-events-data:9200/_cluster/reroute?retry_failed=true
  4. To make sure all shards are in the started state, run the following command:

    curl -s  -XGET http://opensearch-events-data:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason|grep -v STARTED
  5. To make sure the cluster is in a healthy state, run the following command:

    curl -s  -XGET http://opensearch-events-data:9200/_cluster/health

    Sample output:

    {"cluster_name":"opensearch-events","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,"discovered_master":true,"active_primary_shards":144,"active_shards":348,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}event-service-96d8b4cdd-27fhk:/opt/bmc/deployer/bin$

    Confirm that the status parameter value is green and the active_shards_percent_as_number parameter value is 100.

 

 

OpenSearch services are down

You receive an event based on the alarm policy configured for Elasticsearch, Node or Elasticsearch service status.  

OpenSearchDown1.png

To resolve the event:

  1. Run the following commands:

    oc get pods -n <ITOM namespace>|grep opensearch-event

    oc get pods -n <ITOM namespace>|grep opensearch-logs

    Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

  2. Make sure all the pods are in Running state.
  3. Delete the pods that are not in a Running state by using the following command:

    oc delete pod opensearch-logs-data-0 -n itom

    The deployment automatically re-create the deleted pods.

 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*