Troubleshooting

This troubleshooting section is intended to assist in identifying and resolving potential issues.

Issue

Solution

Unable to get a response when the code exceeds 400 lines

This issue occurs because the engine was built without specifying the max_num_tokens parameter, which defaults to a lower value. To resolve it, rebuild the engine using the following command, with the parameter added:

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
             --gemm_plugin float16 \
             --output_dir ${ENGINE_PATH} \
             --max_num_tokens 32768

Here, 32768 represents context length for Mixtral since they are using BYOLLM feature with Mixtral deployed on the Triton server.

If the load-embeddings job is in a failed state.

kubectl get jobs -n bmcami-prod-amiai-services

Name	Status	Completions
load-embeddings	Failed	0/1

Execute the following commands to restart job.

kubectl delete job load-embeddings -n bmcami-prod-amiai-services
helm upgrade amiai-services /<extracted_dir>/BMC-AMI-PLATFORM-2.0.00/helm_charts/07-helm-amiai-chart/ --namespace bmcami-prod-amiai-services --reuse-values

When you get the following error in BMC AMI Assistant chat:
The Assistant service is currently unavailable. Try again. If the problem persists, contact your BMC AMI Platform administrator.

Click the Platform manager > BMC AMI AI Manager > BMC AMI Assistant chat settings.
In the BMC AMI Assistant chat settings page, check if a model is assigned to BMC AMI Assistant. If a model is not assigned, assign the recommended model and retry the BMC AMI Assistant. If a model is assigned, navigate to the AI service health page and verify that all services and models are running.

To check AI services health:

Click the Platform manager > BMC AMI AI Manager > AI service health.
If any services are down, make sure that the down services are up and running.

Issue: “READONLY You can't write against a read only replica”.

You might encounter the following error message on the Uptrace UI during login:

READONLY You can't write against a read only replica.
script: 6922f2a31daf24cb6eb66399c4474600e84a5c09, on @user_script:43.

Log in to the setup machine environment.
Run the following command to list Redis pods:
Kubectl get pod -A | grep redis
Restart all pods related to Redis using the command:
kubectl delete po redis-master-0 redis-replica-0 redis-replica-1 redis-sentinel-0 redis-sentinel-1 redis-sentinel-2 -n bmcami-prod-data-service
Wait for few minutes.
Verify the fix by logging back into the Uptrace service.

The CES instance isn't launching from BMC AMI Platform.

Make sure that your CES host is running and is accessible via HTTPS. Verify the host connectivity by clicking Test connection.

You can't add a CES instance.

You can add CES instances to BMC AMI Platform only if you are using CES version 20.15.03 or later. Earlier versions aren't supported because they don't use the ADAPT user interface.
Confirm that the host name is correct, HTTPS is used, and the port (default 48443) is open. Also verify that the CES role is assigned to your user account.
If your CES instance is using HTTPS with an untrusted certificate, you can't add the CES instance directly.
Workaround: Open the CES instance URL in your browser. When you encounter the security warning, click Continue to bypass it and allow the CES user interface to load completely. After the CES UI is loaded, you can proceed to add the CES instance successfully.

Authentication fails when you're adding a CES instance.

Confirm that CES credentials are correct. For CES versions earlier than 23.04.06, you must enter credentials each time.

The CES version is not displayed during the setup.

Make sure that the CES instance is running and accessible. The version number appears only after successful authentication.

The CES instance is displayed as unavailable.

Click Test connection to check host status. If required, restart the CES host.

An HTTPS requirement error has occurred.

CES must use HTTPS for integration with BMC AMI Platform. Update CES configuration to enable HTTPS.

Credentials not saved for future access.

Upgrade to CES version 23.04.06 or 24.05.01 to enable credential storage in the BMC AMI platform database.

BMC AMI Platform natively supports CES versions 23.04.06 and later modifications levels within the 23.04 release, as well as 24.05.01 and later. When you add a CES instance by using these versions, the credentials you enter are securely stored in the database and automatically reused for future access.

Important

If you are using CES versions 24.01.xx, 24.02.xx, 24.03.xx, 24.04.xx, you can access CES via BMC AMI Platform, but you must enter your credentials each time you access CES from the platform.

BMC AMI Platform also supports CES version 20.15.03 or later, but you must enter your credentials each time you access CES from BMC AMI Platform.

The deployment completed, but the container images could not be pulled because they do not exist in the repository.

Remove the deployment by using the following teardown script and run the deployment again:

Important

Make sure to change the NFS folder to your own in the teardown script.

#!/bin/bash

# Kubernetes namespace cleanup script
# This script cleans up Helm releases and all resources in specified namespaces

# Define namespaces
namespaces="bmcami-prod-user-management bmcami-prod-amiai-services bmcami-prod-data-service bmcami-prod-observability"

echo "===== Deleting Helm releases (if any) ====="
for ns in $namespaces; do
  echo "Namespace: $ns"
  releases=$(helm list -n "$ns" --short)
  if [ -n "$releases" ]; then
    echo "$releases" | xargs -r -I{} helm uninstall {} -n "$ns"
  else
    echo "No Helm releases found in namespace $ns"
  fi
done

echo "===== Deleting all resources in those namespaces ====="
for ns in $namespaces; do
  echo "Cleaning namespace: $ns"
  kubectl delete all --all -n "$ns" --ignore-not-found=true
  kubectl delete pvc --all -n "$ns" --ignore-not-found=true
  kubectl delete secret --all -n "$ns" --ignore-not-found=true
  kubectl delete configmap --all -n "$ns" --ignore-not-found=true
  kubectl delete rolebinding,role,serviceaccount,networkpolicy,ingress -n "$ns" --all --ignore-not-found=true
done

echo "===== Deleting the namespaces ====="
kubectl get namespace --no-headers -o custom-columns=:metadata.name | grep -E "bmcami-prod-(user-management|amiai-services|data-service|observability)" | xargs -r kubectl delete namespace

echo "===== Deleting related PVs ====="
kubectl get pv --no-headers -o custom-columns=:metadata.name | grep -E "bmcami-prod-(user-management|amiai-services|data-service|observability)" | xargs -r kubectl delete pv

echo "===== Cleaning up NFS storage ====="
if [ -d "/mnt/nfs" ]; then
  echo "Deleting all contents in /mnt/nfs/"
  sudo rm -rf /mnt/nfs/*
  if [ $? -eq 0 ]; then
    echo "NFS storage cleaned successfully"
  else
    echo "Failed to clean NFS storage"
  fi
else
  echo "NFS directory /mnt/nfs/ does not exist"
fi

echo "===== Verifying cleanup ====="
for ns in $namespaces; do
  echo "Checking $ns..."
  if kubectl get namespace "$ns" &>/dev/null; then
    echo "Namespace $ns still exists"
    kubectl get all -n "$ns" 2>/dev/null || echo "No resources found in namespace $ns"
  else
    echo "Namespace $ns fully removed."
  fi
done

remaining_pvs=$(kubectl get pv --no-headers -o custom-columns=:metadata.name 2>/dev/null | grep -E "bmcami-prod-(user-management|amiai-services|data-service|observability)" || true)
if [ -n "$remaining_pvs" ]; then
  echo "Remaining PVs:"
  echo "$remaining_pvs"
else
  echo "No PVs remaining."
fi

echo "===== Cleanup completed ====="

rm -rf /mnt/nfs

Shared vLLM Instance

When a vLLM instance is shared across multiple applications, it may become unstable and return errors such as:

Error: Server disconnected without sending a response

This is typically caused by resource contention or high load.

Recommendation

Dedicated instances: Use a dedicated vLLM instance per service (for example, knowledge hub) to ensure stability and avoid conflicts.
Scalability and availability: Ensure proper scalability and availability of the service hosting the embedding model.

BMC AMI AI knowledge hub and OCR service troubleshooting

This table helps you identify and resolve common issues in the BMC AMI AI knowledge hub and OCR service.

Issue	Resolution

OCR service crashes or restarts when processing large or image-heavy PDFs

The OCR service requires significant memory. Large files or high-resolution images can exceed configured limits.

Administrator—Increase the resources.limits.memory and resources.requests.memory values for the OCR service in the deployment configuration. For example, in the Helm chart.

End user—Compress the PDF to reduce image resolution and file size, then upload it again.

OCR requests fail with a 504 Gateway Timeout error

The OCR process exceeds the configured timeout, typically because of document size or complexity.

Administrator—Increase the OCR_SERVICE_TIMEOUT_S value (default is 7200 seconds) in the environment configuration.

End user—Split the document into smaller files or fewer pages, then try again.

Asset publication fails because the document is password-protected or encrypted

The system cannot extract text from secured files.

End user—Open the file in its native application. For example, Acrobat or Word. Remove the password or encryption, save the file, and upload it again.

AAPKNW027E—Unable to update records in the database during publication

The system cannot communicate with the vector search index (Milvus).

Administrator—Verify that the vector database service is running and reachable. Check the host, port, and credentials, and make sure the database has sufficient disk space and memory.

Asset upload fails with Unrecognized file typeeven though the extension is correct

The system cannot detect a valid MIME type, or the file content does not match the extension.

End user—Re-export the file from the original application to make sure it is not corrupted. Make sure the file format matches the extension.

Asset is locked because another user is working with it

The asset is in use. For example, publishing, unpublishing, or cancelling.

End user—Refresh the asset table and wait for the operation to complete. If the issue persists, contact your administrator.

Administrator—Check the workflow status in the logs and verify that the process completes successfully.

Troubleshooting

BMC AMI AI knowledge hub and OCR service troubleshooting

BMC AMI Platform 2.2

On this page