Troubleshooting
This troubleshooting section is intended to assist in identifying and resolving potential issues.
| Issue | Solution | ||||||
|---|---|---|---|---|---|---|---|
| Unable to get a response when the code exceeds 400 lines | This issue occurs because the engine was built without specifying the max_num_tokens parameter, which defaults to a lower value. To resolve it, rebuild the engine using the following command, with the parameter added: trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \ --gemm_plugin float16 \ --output_dir ${ENGINE_PATH} \ --max_num_tokens 32768 Here, 32768 represents context length for Mixtral since they are using BYOLLM feature with Mixtral deployed on the Triton server. | ||||||
If the load-embeddings job is in a failed state. kubectl get jobs -n bmcami-prod-amiai-services
| Execute the following commands to restart job. kubectl delete job load-embeddings -n bmcami-prod-amiai-services helm upgrade amiai-services /<extracted_dir>/BMC-AMI-PLATFORM-2.0.00/helm_charts/07-helm-amiai-chart/ --namespace bmcami-prod-amiai-services --reuse-values | ||||||
| When you get the following error in BMC AMI Assistant chat: The Assistant service is currently unavailable. Try again. If the problem persists, contact your BMC AMI Platform administrator. |
To check AI services health:
| ||||||
Issue: “READONLY You can't write against a read only replica”. You might encounter the following error message on the Uptrace UI during login: READONLY You can't write against a read only replica. |
| ||||||
| The CES instance isn't launching from BMC AMI Platform. | Make sure that your CES host is running and is accessible via HTTPS. Verify the host connectivity by clicking Test connection. | ||||||
| You can't add a CES instance. |
| ||||||
| Authentication fails when you're adding a CES instance. | Confirm that CES credentials are correct. For CES versions earlier than 23.04.06, you must enter credentials each time. | ||||||
| The CES version is not displayed during the setup. | Make sure that the CES instance is running and accessible. The version number appears only after successful authentication. | ||||||
| The CES instance is displayed as unavailable. | Click Test connection to check host status. If required, restart the CES host. | ||||||
| An HTTPS requirement error has occurred. | CES must use HTTPS for integration with BMC AMI Platform. Update CES configuration to enable HTTPS. | ||||||
| Credentials not saved for future access. | Upgrade to CES version 23.04.06 or 24.05.01 to enable credential storage in the BMC AMI platform database. BMC AMI Platform natively supports CES versions 23.04.06 and later modifications levels within the 23.04 release, as well as 24.05.01 and later. When you add a CES instance by using these versions, the credentials you enter are securely stored in the database and automatically reused for future access. BMC AMI Platform also supports CES version 20.15.03 or later, but you must enter your credentials each time you access CES from BMC AMI Platform. | ||||||
| The deployment completed, but the container images could not be pulled because they do not exist in the repository. | Remove the deployment by using the following teardown script and run the deployment again: #!/bin/bash # Kubernetes namespace cleanup script # This script cleans up Helm releases and all resources in specified namespaces # Define namespaces namespaces="bmcami-prod-user-management bmcami-prod-amiai-services bmcami-prod-data-service bmcami-prod-observability" echo "===== Deleting Helm releases (if any) =====" for ns in $namespaces; do echo "Namespace: $ns" releases=$(helm list -n "$ns" --short) if [ -n "$releases" ]; then echo "$releases" | xargs -r -I{} helm uninstall {} -n "$ns" else echo "No Helm releases found in namespace $ns" fi done echo "===== Deleting all resources in those namespaces =====" for ns in $namespaces; do echo "Cleaning namespace: $ns" kubectl delete all --all -n "$ns" --ignore-not-found=true kubectl delete pvc --all -n "$ns" --ignore-not-found=true kubectl delete secret --all -n "$ns" --ignore-not-found=true kubectl delete configmap --all -n "$ns" --ignore-not-found=true kubectl delete rolebinding,role,serviceaccount,networkpolicy,ingress -n "$ns" --all --ignore-not-found=true done echo "===== Deleting the namespaces =====" kubectl get namespace --no-headers -o custom-columns=:metadata.name | grep -E "bmcami-prod-(user-management|amiai-services|data-service|observability)" | xargs -r kubectl delete namespace echo "===== Deleting related PVs =====" kubectl get pv --no-headers -o custom-columns=:metadata.name | grep -E "bmcami-prod-(user-management|amiai-services|data-service|observability)" | xargs -r kubectl delete pv echo "===== Cleaning up NFS storage =====" if [ -d "/mnt/nfs" ]; then echo "Deleting all contents in /mnt/nfs/" sudo rm -rf /mnt/nfs/* if [ $? -eq 0 ]; then echo "NFS storage cleaned successfully" else echo "Failed to clean NFS storage" fi else echo "NFS directory /mnt/nfs/ does not exist" fi echo "===== Verifying cleanup =====" for ns in $namespaces; do echo "Checking $ns..." if kubectl get namespace "$ns" &>/dev/null; then echo "Namespace $ns still exists" kubectl get all -n "$ns" 2>/dev/null || echo "No resources found in namespace $ns" else echo "Namespace $ns fully removed." fi done remaining_pvs=$(kubectl get pv --no-headers -o custom-columns=:metadata.name 2>/dev/null | grep -E "bmcami-prod-(user-management|amiai-services|data-service|observability)" || true) if [ -n "$remaining_pvs" ]; then echo "Remaining PVs:" echo "$remaining_pvs" else echo "No PVs remaining." fi echo "===== Cleanup completed =====" rm -rf /mnt/nfs | ||||||
Shared vLLM Instance When a vLLM instance is shared across multiple applications, it may become unstable and return errors such as: Error: Server disconnected without sending a response | This is typically caused by resource contention or high load. Recommendation
|
BMC AMI AI knowledge hub and OCR service troubleshooting
This table helps you identify and resolve common issues in the BMC AMI AI knowledge hub and OCR service.
| Issue | Resolution |
|---|
| OCR service crashes or restarts when processing large or image-heavy PDFs | The OCR service requires significant memory. Large files or high-resolution images can exceed configured limits. Administrator—Increase the resources.limits.memory and resources.requests.memory values for the OCR service in the deployment configuration. For example, in the Helm chart. End user—Compress the PDF to reduce image resolution and file size, then upload it again. |
| OCR requests fail with a 504 Gateway Timeout error | The OCR process exceeds the configured timeout, typically because of document size or complexity. Administrator—Increase the OCR_SERVICE_TIMEOUT_S value (default is 7200 seconds) in the environment configuration. End user—Split the document into smaller files or fewer pages, then try again. |
| Asset publication fails because the document is password-protected or encrypted | The system cannot extract text from secured files. End user—Open the file in its native application. For example, Acrobat or Word. Remove the password or encryption, save the file, and upload it again. |
| AAPKNW027E—Unable to update records in the database during publication | The system cannot communicate with the vector search index (Milvus). Administrator—Verify that the vector database service is running and reachable. Check the host, port, and credentials, and make sure the database has sufficient disk space and memory. |
Asset upload fails with Unrecognized file typeeven though the extension is correct | The system cannot detect a valid MIME type, or the file content does not match the extension. End user—Re-export the file from the original application to make sure it is not corrupted. Make sure the file format matches the extension. |
| Asset is locked because another user is working with it | The asset is in use. For example, publishing, unpublishing, or cancelling. End user—Refresh the asset table and wait for the operation to complete. If the issue persists, contact your administrator. Administrator—Check the workflow status in the logs and verify that the process completes successfully. |