Troubleshooting issues reported by the monitoring solution

After setting up the monitoring of your on-premises BMC Helix IT Operations Management, you receive events based on the alert and alarm policies that you configured. Use the information in this section to resolve the events.

Postgres replica pod does not stream data

You receive an event based on the alarm policy you configured for PostgreSQL Replications or Primary Replication Slot Lag.
For more information, see Configure Self-Monitoring-PostgreSQL policy.

Here is an example of an alarm policy configuration:

Alarm Policy configuration.png

To resolve the event:

To get the list of PostgreSQL pods, run the following command:
oc get pods -n <ITOM namespace>|grep postgres-bmc-pg
Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
Log in to any of the PostgreSQL pods by running the following command:
oc exec -it postgres-bmc-pg-ha-15-0 -n <ITOM namespace> bash
Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
To find the state of the PostgreSQL pods, run the following command:
patronictl list
In this example, postgres-bmc-pg-ha-15-2 is not streaming data.
Reinitialize the process by running the following command:
patronictl reinit postgres-bmc-pg-ha-15 postgres-bmc-pg-ha-15-2
To verify the pod is in a streaming state, re-run the patronictl list command:

Pod postgres-bmc-pg-ha-15-2 is now streaming data.

Elasticsearch cluster is in an unhealthy state due to unassigned shards

You receive an event based on the alarm policy configured for Elasticsearch, Shards, or unassigned shards.

Here is an example of an alarm policy configuration:
Unassigned shards Alarm Policy configuration.png

For more information, see Configure the Self-Monitoring-Opensearch policy.

To resolve the event:

Log in to any of the event-service pods by running the following command:
oc exec -it event-service-<pod id> bash
To verify that the issue exists, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cluster/health
Sample output:
“active_shards_percent_as_number": 80If the active_shards_percent_as_number is less than 100, it indicates that an issue exists.
To resolve the issue, run the following command:
curl -XPOST http://opensearch-events-data:9200/_cluster/reroute?retry_failed=true
To confirm all shards are in the started state, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason|grep -v STARTED
To confirm that the cluster is in a healthy state, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cluster/health
{"cluster_name":"opensearch-events","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,
"discovered_master":true,"active_primary_shards":144,"active_shards":348,"relocating_shards":0,"initializing_shards":0,
"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,
"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}event-service-96d8b4cdd-27fhk:/opt/bmc/deployer/bin$
Confirm that the status parameter value is green and the active_shards_percent_as_number parameter value is 100.

Application pods are getting empty service registration information from Zookeeper

The event is based on the Log Analytics connector for Kubernetes and a Log Analytics Alert Policy:

The 'Filtering Rule' Regex pattern for the Agent configuration in the values.yaml file is:
- regex: "regex log .*ERROR.*|.*NOT_ENOUGH_REPLICAS.*|.*Exception.*|.*invalid permission.*|.*TSMicroserviceNotAvailableException.* "

An example of the Log Analytics Policy Selection Criteria in the alert policy is:
( bmc_connector_name EQUALS 'log-analytics-k8s-connector' ) AND ( message CONTAINS 'com.bmc.truesight.tspodlibrary.TSMicroserviceNotAvailableException:' )
The alert policy must have the following condition as an 'Or' criteria along with other criteria: message CONTAINS 'com.bmc.truesight.tspodlibrary.TSMicroserviceNotAvailableException:'

For more information, see Creating Alert policies in BMC Helix Log Analytics.

To resolve the event, restart the services by running the following commands:

oc -n <ITOM namespace> rollout restart deploy agent-configuration-service agent-health-monitor-service anomaly-service autoanomaly-service cli-service data-download-gateway-service deployment-repository-service hm-preprocessor-service hm-preprocessor-query-service thirdparty-entity-reconciler-service impact-management-service itil-service managed-object-service notification-service open-data-gateway-service prometheus-entity-reconciler-service tag-service ui-service user-management-service

oc -n <ITOM namespace> rollout restart deploy anomaly-detection-service event-ingestion-service event-mgmt-service event-processor-service event-service anomaly-event-ingestion-service anomaly-event-processor-service timeframe-service tmf-events-service metric-aggregation-service metric-configuration-service metric-gateway-service metric-query-service metric-ingestion-service thirdparty-ingestion-service prometheus-ingestion-service ml-model-mgmt-service

Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

Application calls are waiting for PostgreSQL database read operation

You receive an event based on the alarm policy configured for PostgreSQL Environment | Instance Availability:

For more information, see Configure Self-Monitoring-PostgreSQL policy.

To resolve the event, restart the PostgreSQL pool pods by running the following command:

oc -n <ITOM namespace> rollout restart deploy <PostgreSQL pool pod>

Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

Example:

oc -n itom rollout restart deploy postgres-bmc-pg-ha-15-pool

Services could not connect to Redis master pod through the Redis HA proxy service

The event is based on the Log Analytics connector for Kubernetes and a Log Analytics Alert Policy. The event is generated based on the following Regex pattern set in the alert policy as an 'Or' criteria along with other criteria: message CONTAINS 'org.redisson.client.RedisException: READONLY'

For more information, see "To create Alert policies in BMC Helix Log Analytics" in Configuring the log collection and alert policies.

To resolve the event, restart the Redis HA proxy pods by running the following command:

oc -n <ITOM namespace> rollout restart deploy redis-redis-ha-haproxy

Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.

Postgres Replica Pod does not stream the data to the standby pod

You receive an event based on the alarm policy configured for the Standby Replication Lag attribute under the PostgreSQL Replications application class.

To resolve the event:

Log in to the postgres-bmc-pg-ha-15-0 pod:
oc exec -it postgres-bmc-pg-ha-15-0 bash
Run the command patronictl list to validate whether the pods are streaming data.
In the example, postgres-bmc-pg-ha-15-2 is not streaming data.
Run the following command to reinitialize the process:
patronictl reinit postgres-bmc-pg-ha-15 postgres-bmc-pg-ha-15-2
Re-run the patronictl list command to verify the pod is streaming data.

Elasticsearch cluster is in an unhealthy state

You receive an event based on the alarm policy configured for Elasticsearch, Cluster or Cluster Status.

Example:

To resolve the event:

Log into any of the event-service pods.
oc exec -it event-service-96d8b4cdd-27fhk bash
To verify that the issue exists, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cluster/health
Sample output:
{"cluster_name":"opensearch-events","status":"red","timed_out":false
If the status of the cluster is not green, it indicates an issue exists.
If the active_shards_percent_as_number is less than 100, it indicates that an issue exists.
To fix the issue, run the following curl command:
curl -XPOST http://opensearch-events-data:9200/_cluster/reroute?retry_failed=true
To make sure all shards are in the started state, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason|grep -v STARTED
To make sure the cluster is in a healthy state, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cluster/health
Sample output:
{"cluster_name":"opensearch-events","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,"discovered_master":true,"active_primary_shards":144,"active_shards":348,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}event-service-96d8b4cdd-27fhk:/opt/bmc/deployer/bin$
Confirm that the status parameter value is green and the active_shards_percent_as_number parameter value is 100.

OpenSearch services are down

You receive an event based on the alarm policy configured for Elasticsearch, Node or Elasticsearch service status.

To resolve the event:

Run the following commands:
oc get pods -n <ITOM namespace>|grep opensearch-event

oc get pods -n <ITOM namespace>|grep opensearch-logs
Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
Make sure all the pods are in Running state.
Delete the pods that are not in a Running state by using the following command:
oc delete pod opensearch-logs-data-0 -n itom
The deployment automatically re-create the deleted pods.

Troubleshooting issues reported by the monitoring solution

Postgres replica pod does not stream data

You receive an event based on the alarm policy you configured for PostgreSQL Replications or Primary Replication Slot Lag. For more information, see Configure Self-Monitoring-PostgreSQL policy.

You receive an event based on the alarm policy configured for Elasticsearch, Shards, or unassigned shards.

Application pods are getting empty service registration information from Zookeeper

Application calls are waiting for PostgreSQL database read operation

You receive an event based on the alarm policy configured for PostgreSQL Environment | Instance Availability:

Services could not connect to Redis master pod through the Redis HA proxy service

Postgres Replica Pod does not stream the data to the standby pod

Elasticsearch cluster is in an unhealthy state

OpenSearch services are down

On this page

You receive an event based on the alarm policy you configured for PostgreSQL Replications or Primary Replication Slot Lag.
For more information, see Configure Self-Monitoring-PostgreSQL policy.