Troubleshooting PostgreSQL issues


Use the information in this topic to troubleshoot issues related to PostgreSQL: 


PostgreSQL is running out of disk and applications fail to connect to the PostgreSQL server.

Scope

This issue occurs when the pg_wal directory in the PG (Postgres) PVC consumes all the available space in the /data/pg_wal directory.

Workaround

To resolve the issue, make sure that the replication state is active for secondary PostgreSQL instances, check the PostgreSQL logs, and take corrective actions. 
For more information, see PostgreSQL is running out of disk space due to pg_wal (Write-Ahead Logging)


Error in PostgreSQL pods when the connection to the server fails

The following error is displayed in the PostgreSQL logs:

pg_basebackup: error: connection to server at "xx.xx.x.xxx", port 5432 failed: Connection timed out

Example:

pg_basebackup: error: connection to server at "10.42.76.7", port 5432 failed: Connection timed out

Scope

The error occurs when the connection to the server is timed out.

Workaround

You must move the affected PostgreSQL pod (PG pod) from the assigned node, which is not reachable, to another node.

Perform the following steps:

  1. To identify the pod to which the IP in the error message belongs and the node on which the pod is running, run the following command:

    kubectl -n <namespace>  get po -o wide  | grep postgres-bmc-pg-ha-15- | grep -v pool

    Sample output:

    #kubectl -n helix  get po -o wide  | grep postgres-bmc-pg-ha-15- | grep -v pool
    NAME                            READY   STATUS      RESTARTS         AGE     IP              NODE              NOMINATED NODE   READINESS GATES
    postgres-bmc-pg-ha-15-0         2/2     Running     0                3d19h   10.42.76.7      vl-aus-domqa353   <none>           <none>
    postgres-bmc-pg-ha-15-1         2/2     Running     0                3d19h   10.42.174.154   vl-aus-domqa371   <none>           <none>
    postgres-bmc-pg-ha-15-2         2/2     Running     0                3d19h   10.42.3.161     vl-aus-domqa369   <none>           <none>

    For example, the IP 10.42.76.7 belongs to the pod postgres-bmc-pg-ha-15-0 and runs on the node vl-aus-domqa353.

  2. To make sure the IP in the error message is reachable from another PG pod, run the following command from the affected pod (postgres-bmc-pg-ha-15-1 ):

    curl <IP from the error message>:5432

    You will not receive any response if the IP in the error message is unreachable.

    You will receive a response if the IP in the error message is reachable.
    Sample output:
    DRRE3-5766.png

  3. If the IP is unreachable, you must move the pod you identified in Step 1 to another node:
    1. To cordon a node (vl-aus-domqa353) where the affected pod (postgres-bmc-pg-ha-15-0) is running, run the following command:

      kubectl -n <namespace> cordon <node>

      Example:

      kubectl -n helix cordon vl-aus-domqa353
    2. To delete the pod, run the following command:

      kubectl -n <namespace> delete po <pod-name>

      Example:

      kubectl -n helix delete po postgres-bmc-pg-ha-15-0
    3. To confirm the deleted pod is recreated and running on another node, run the following command:

      kubectl -n <namespace> get po -o wide  | grep postgres-bmc-pg-ha-15- | grep -v pool

      Example:

      #kubectl -n helix  get po -o wide  | grep postgres-bmc-pg-ha-15- | grep -v pool

      NAME                            READY   STATUS      RESTARTS         AGE     IP              NODE              NOMINATED NODE   READINESS GATES
      postgres-bmc-pg-ha-15-0         2/2     Running     0                15s     10.42.76.7      vl-aus-domqa384   <none>           <none>
      postgres-bmc-pg-ha-15-1         2/2     Running     0                3d19h   10.42.174.154   vl-aus-domqa371   <none>           <none>
      postgres-bmc-pg-ha-15-2         2/2     Running     0                3d19h   10.42.3.161     vl-aus-domqa369   <none>           <none>

      Note that the pod postgres-bmc-pg-ha-15-0 is running on another node, vl-aus-domqa384.

    4. Uncordon the node; for example, vl-aus-domqa353:

       kubectl -n <namespace> uncordon <node>

      Example:

      kubectl -n helix uncordon vl-aus-domqa353

The postgres-replication-monitor pod is continuously in the CrashLoopBackOff state

Scope

The postgres-replication-monitor pod monitors the replication of the PostgreSQL pods. If the postgres-replication-monitor pod detects issues with any one of the replicas, it tries to remediate the issue. If both replicas have issues, or if the postgres-replication-monitor pod is unable to fix the issue, it gets into the CrashLoopBackOff state.

Resolution

Contact the BMC Support team.





 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*