Deploying LLM
Related topic
You can deploy the BMC-provided LLM by using one of the following methods:
- Using the LLM library (UI)—This is the recommended method. For more information, see LLM library.
- Manual deployment—We do not recommend this method unless automated deployment isn't possible. Manual deployment involves multiple manual steps.
If you have an existing LLM that you want to integrate into BMC AMI Platform, then use the following method:
- Through BYOLLM: To integrate a large language model (LLM) hosted in your environment into BMC AMI Platform, whether self-hosted via vLLM or provided via the OpenAI service. For more information, see Bring your own LLM.
Before deploying LLM you must configure kubectl access and environment for RKE2 Kubernetes cluster.
To configure kubectl access and environment for RKE2 Kubernetes cluster
Ensure you are logged in to the Kubernetes master node as a root user and run the following commands in sequence.
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
export PATH=$PATH:/var/lib/rancher/rke2/bin
EOF
sed -i 's/^PermitUserEnvironment.*/PermitUserEnvironment yes/' /etc/ssh/sshd_config
else
echo 'PermitUserEnvironment yes' >> /etc/ssh/sshd_config
fi
KUBECONFIG=/etc/rancher/rke2/rke2.yaml
PATH=/var/lib/rancher/rke2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
EOF
To verify the configuration
To verify the configuration is successful, run the following command.
Deploying LLM manually
The platform supports three LLMs that can be deployed independently:
- Llama 3.1
- Mixtral
- Granite 3.1
You can also deploy a second LLM service on BMC AMI AI Platform. For more information, see To add a second LLM deployment.
To deploy the Llama model on Kubernetes
- Log in to the primary manager node of Kubernetes and access /<extracted_dir>/BMC-AMI-PLATFORM-2.0.00.
- Verify that the scripts/llama.sh file is present.
- Run the llama.sh file and provide the following input:
Field Description GPU node
Select the GPU from the list CPU KVCache space Specify the memory allocated for the LLM KV cache. Higher values allow more parallel requests. For more information, see CPU KVCache value. Number of GPUs The value can be found by running the following command on GPU node.
nvidia-smi --query-gpu=name --format=csv,noheader | wc -l
To verify the deployment
Verify that the service and pod are running successfully under the namespaces by using the following commands:
kubectl get pods --namespace bmcami-prod-amiai-services
NAME READY STATUS RESTARTS AGE
assistant-c6dd4bb6b-4xbkc 1/1 Running 0 22h
discovery-7c68bcd776-chnz6 1/1 Running 1 (22h ago) 22h
docs-expert-5957c5d845-cn2hg 1/1 Running 0 22h
download-embeddings-qfpgl 0/1 Completed 0 22h
download-expert-model-nw9km 0/1 Completed 0 22h
download-llama-model-582bh 0/1 Completed 0 4h53m
gateway-775f4476d9-h6jk6 1/1 Running 0 5h10m
llama-gpu-6f8c675c4b-j7vhm 1/1 Running 0 4h53m
load-embeddings-4pg6r 1/1 Running 0 22h
platform-75c4997dc5-fk8fq 1/1 Running 0 22h
security-65c8c568db-gqsks 1/1 Running 0 22h
To deploy the Mixtral model on Kubernetes
- Log in to the primary manager node of Kubernetes and access /<extracted_dir>/BMC-AMI-PLATFORM-2.0.00.
- Verify that the scripts/mixtral.sh file is present.
- Run the mixtral.sh file and provide the following input:
Field Description GPU node
Select the GPU from the list CPU KVCache space Specify the memory allocated for the LLM KV cache. Higher values allow more parallel requests. For more information, see CPU KVCache value. Number of GPUs The value can be found by running the following command on GPU node.
nvidia-smi --query-gpu=name --format=csv,noheader | wc -l
To verify the deployment
Verify that the service and pod are running successfully under the namespaces by using the following command:
To deploy the Granite model on Kubernetes
- Log in to the primary manager node of Kubernetes and access /<extracted_dir>/BMC-AMI-PLATFORM-2.0.00.
- Verify that the scripts/granite.sh file is present.
- Run the granite.sh file and provide the following input:
Field Description GPU node
Select the GPU from the list CPU KVCache space Specify the memory allocated for the LLM KV cache. Higher values allow more parallel requests. For more information, see CPU KVCache value. Number of GPUs The value can be found by running the following command on GPU node.
nvidia-smi --query-gpu=name --format=csv,noheader | wc -l
To verify the deployment
Verify that the service and pod are running successfully under the namespaces by using the following command:
To add a second LLM deployment
The following instructions describe how to deploy a second LLM service on BMC AMI Platform.
Before you begin
- Make sure your machine meets the minimum system requirements for LLM deployment. For more information, see LLM GPU.
- Make sure the NVIDIA GPU operator is installed. For more information, see To manually install the NVIDIA GPU operator.
To add new node to cluster
- Add your node to the Kubernetes cluster.
- Mount the existing NFS on the newly created node.
- Verify the node exists in the Kubernetes cluster by running the following command: kubectl get nodes
To add a label to the node
Label the node using the following command:
To run the script file to deploy the model
Navigate to the Helm chart directory and run the appropriate script:
./llama.sh # For Llama 3.1
./granite.sh # For Granite 3.1
./mixtral.sh # For Mixtral
Where to go from here
After you deploy LLM you can proceed to the following topics: