Troubleshooting issues reported by the monitoring solution
Postgres replica pod does not stream data
You receive an event based on the alarm policy you configured for PostgreSQL Replications or Primary Replication Slot Lag.
For more information, see Configure Self-Monitoring-PostgreSQL policy.
Here is an example of an alarm policy configuration:
To resolve the event:
To get the list of PostgreSQL pods, run the following command:
oc get pods -n <ITOM namespace>|grep postgres-bmc-pgHere <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
Log in to any of the PostgreSQL pods by running the following command:
oc exec -it postgres-bmc-pg-ha-15-0 -n <ITOM namespace> bashHere <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
- To find the state of the PostgreSQL pods, run the following command:
patronictl list
In this example, postgres-bmc-pg-ha-15-2 is not streaming data. Reinitialize the process by running the following command:
patronictl reinit postgres-bmc-pg-ha-15 postgres-bmc-pg-ha-15-2To verify the pod is in a streaming state, re-run the patronictl list command:
Pod postgres-bmc-pg-ha-15-2 is now streaming data.
Elasticsearch cluster is in an unhealthy state due to unassigned shards
You receive an event based on the alarm policy configured for Elasticsearch, Shards, or unassigned shards.
Here is an example of an alarm policy configuration:
For more information, see Configure the Self-Monitoring-Opensearch policy.
To resolve the event:
Log in to any of the event-service pods by running the following command:
oc exec -it event-service-<pod id> bashTo verify that the issue exists, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cluster/healthSample output:
“active_shards_percent_as_number": 80If the active_shards_percent_as_number is less than 100, it indicates that an issue exists.To resolve the issue, run the following command:
curl -XPOST http://opensearch-events-data:9200/_cluster/reroute?retry_failed=trueTo confirm all shards are in the started state, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason|grep -v STARTEDTo confirm that the cluster is in a healthy state, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cluster/health{"cluster_name":"opensearch-events","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,
"discovered_master":true,"active_primary_shards":144,"active_shards":348,"relocating_shards":0,"initializing_shards":0,
"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,
"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}event-service-96d8b4cdd-27fhk:/opt/bmc/deployer/bin$- Confirm that the status parameter value is green and the active_shards_percent_as_number parameter value is 100.
Application pods are getting empty service registration information from Zookeeper
The event is based on the Log Analytics connector for Kubernetes and a Log Analytics Alert Policy:
The 'Filtering Rule' Regex pattern for the Agent configuration in the values.yaml file is:
- regex: "regex log .*ERROR.*|.*NOT_ENOUGH_REPLICAS.*|.*Exception.*|.*invalid permission.*|.*TSMicroserviceNotAvailableException.* "
An example of the Log Analytics Policy Selection Criteria in the alert policy is:
( bmc_connector_name EQUALS 'log-analytics-k8s-connector' ) AND ( message CONTAINS 'com.bmc.truesight.tspodlibrary.TSMicroserviceNotAvailableException:' )The alert policy must have the following condition as an 'Or' criteria along with other criteria: message CONTAINS 'com.bmc.truesight.tspodlibrary.TSMicroserviceNotAvailableException:'
For more information, see Creating Alert policies in BMC Helix Log Analytics.
To resolve the event, restart the services by running the following commands:
Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
Application calls are waiting for PostgreSQL database read operation
You receive an event based on the alarm policy configured for PostgreSQL Environment | Instance Availability:

For more information, see Configure Self-Monitoring-PostgreSQL policy.
To resolve the event, restart the PostgreSQL pool pods by running the following command:
Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
Example:
Services could not connect to Redis master pod through the Redis HA proxy service
The event is based on the Log Analytics connector for Kubernetes and a Log Analytics Alert Policy. The event is generated based on the following Regex pattern set in the alert policy as an 'Or' criteria along with other criteria: message CONTAINS 'org.redisson.client.RedisException: READONLY'
For more information, see "To create Alert policies in BMC Helix Log Analytics" in Configuring the log collection and alert policies.
To resolve the event, restart the Redis HA proxy pods by running the following command:
Here <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
Postgres Replica Pod does not stream the data to the standby pod
You receive an event based on the alarm policy configured for the Standby Replication Lag attribute under the PostgreSQL Replications application class.
To resolve the event:
Log in to the postgres-bmc-pg-ha-15-0 pod:
oc exec -it postgres-bmc-pg-ha-15-0 bashRun the command patronictl list to validate whether the pods are streaming data.
In the example, postgres-bmc-pg-ha-15-2 is not streaming data.
Run the following command to reinitialize the process:
patronictl reinit postgres-bmc-pg-ha-15 postgres-bmc-pg-ha-15-2- Re-run the patronictl list command to verify the pod is streaming data.
Elasticsearch cluster is in an unhealthy state
You receive an event based on the alarm policy configured for Elasticsearch, Cluster or Cluster Status.
Example:
To resolve the event:
Log into any of the event-service pods.
oc exec -it event-service-96d8b4cdd-27fhk bashTo verify that the issue exists, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cluster/healthSample output:
{"cluster_name":"opensearch-events","status":"red","timed_out":false
If the status of the cluster is not green, it indicates an issue exists.
If the active_shards_percent_as_number is less than 100, it indicates that an issue exists.To fix the issue, run the following curl command:
curl -XPOST http://opensearch-events-data:9200/_cluster/reroute?retry_failed=trueTo make sure all shards are in the started state, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason|grep -v STARTEDTo make sure the cluster is in a healthy state, run the following command:
curl -s -XGET http://opensearch-events-data:9200/_cluster/healthSample output:
{"cluster_name":"opensearch-events","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,"discovered_master":true,"active_primary_shards":144,"active_shards":348,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}event-service-96d8b4cdd-27fhk:/opt/bmc/deployer/bin$Confirm that the status parameter value is green and the active_shards_percent_as_number parameter value is 100.
OpenSearch services are down
You receive an event based on the alarm policy configured for Elasticsearch, Node or Elasticsearch service status.
To resolve the event:
Run the following commands:
oc get pods -n <ITOM namespace>|grep opensearch-event
oc get pods -n <ITOM namespace>|grep opensearch-logsHere <ITOM namespace> is the name of the namespace where you installed BMC Helix IT Operations Management.
- Make sure all the pods are in Running state.
Delete the pods that are not in a Running state by using the following command:
oc delete pod opensearch-logs-data-0 -n itomThe deployment automatically re-create the deleted pods.