Troubleshooting PostgreSQL issues
PostgreSQL is running out of disk and applications fail to connect to the PostgreSQL server.
Scope
This issue occurs when the pg_wal directory in the PG (Postgres) PVC consumes all the available space in the /data/pg_wal directory.
Workaround
To resolve the issue, make sure that the replication state is active for secondary PostgreSQL instances, check the PostgreSQL logs, and take corrective actions.
For more information, see PostgreSQL is running out of disk space due to pg_wal (Write-Ahead Logging)
Error in PostgreSQL pods when the connection to the server fails
The following error is displayed in the PostgreSQL logs:
pg_basebackup: error: connection to server at "xx.xx.x.xxx", port 5432 failed: Connection timed out
Example:
pg_basebackup: error: connection to server at "10.42.76.7", port 5432 failed: Connection timed out
Scope
The error occurs when the connection to the server is timed out.
Workaround
You must move the affected PostgreSQL pod (PG pod) from the assigned node, which is not reachable, to another node.
Perform the following steps:
To identify the pod to which the IP in the error message belongs and the node on which the pod is running, run the following command:
kubectl -n <namespace> get po -o wide | grep postgres-bmc-pg-ha-15- | grep -v poolSample output:
#kubectl -n helix get po -o wide | grep postgres-bmc-pg-ha-15- | grep -v pool
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
postgres-bmc-pg-ha-15-0 2/2 Running 0 3d19h 10.42.76.7 vl-aus-domqa353 <none> <none>
postgres-bmc-pg-ha-15-1 2/2 Running 0 3d19h 10.42.174.154 vl-aus-domqa371 <none> <none>
postgres-bmc-pg-ha-15-2 2/2 Running 0 3d19h 10.42.3.161 vl-aus-domqa369 <none> <none>For example, the IP 10.42.76.7 belongs to the pod postgres-bmc-pg-ha-15-0 and runs on the node vl-aus-domqa353.
To make sure the IP in the error message is reachable from another PG pod, run the following command from the affected pod (postgres-bmc-pg-ha-15-1 ):
curl <IP from the error message>:5432You will not receive any response if the IP in the error message is unreachable.
- If the IP is unreachable, you must move the pod you identified in Step 1 to another node:
To cordon a node (vl-aus-domqa353) where the affected pod (postgres-bmc-pg-ha-15-0) is running, run the following command:
kubectl -n <namespace> cordon <node>Example:
kubectl -n helix cordon vl-aus-domqa353To delete the pod, run the following command:
kubectl -n <namespace> delete po <pod-name>Example:
kubectl -n helix delete po postgres-bmc-pg-ha-15-0To confirm the deleted pod is recreated and running on another node, run the following command:
kubectl -n <namespace> get po -o wide | grep postgres-bmc-pg-ha-15- | grep -v poolExample:
#kubectl -n helix get po -o wide | grep postgres-bmc-pg-ha-15- | grep -v pool
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
postgres-bmc-pg-ha-15-0 2/2 Running 0 15s 10.42.76.7 vl-aus-domqa384 <none> <none>
postgres-bmc-pg-ha-15-1 2/2 Running 0 3d19h 10.42.174.154 vl-aus-domqa371 <none> <none>
postgres-bmc-pg-ha-15-2 2/2 Running 0 3d19h 10.42.3.161 vl-aus-domqa369 <none> <none>Note that the pod postgres-bmc-pg-ha-15-0 is running on another node, vl-aus-domqa384.
Uncordon the node; for example, vl-aus-domqa353:
kubectl -n <namespace> uncordon <node>Example:
kubectl -n helix uncordon vl-aus-domqa353
The postgres-replication-monitor pod is continuously in the CrashLoopBackOff state
Scope
The postgres-replication-monitor pod monitors the replication of the PostgreSQL pods. If the postgres-replication-monitor pod detects issues with any one of the replicas, it tries to remediate the issue. If both replicas have issues, or if the postgres-replication-monitor pod is unable to fix the issue, it gets into the CrashLoopBackOff state.
Resolution
Contact the BMC Support team.