Key concepts

Related topics

Kubernetes

Kubernetes clusters provide the control plane and worker infrastructure that schedule, run, and heal containerized workloads. Control-plane nodes host the API server, scheduler, controller manager, and etcd store; worker nodes run the application pods via kubelet and kube-proxy. Together, they expose a uniform interface for deploying services, balancing traffic, scaling on demand, and recovering automatically when nodes or pods fail.

Core Objectives

Standardize how services run across environments.
Scale quickly while keeping deployments reproducible.
Enable controlled rollouts and fast recovery.

Key Kubernetes Primitives

Deployment: declares the desired app version, pod count, and rollback strategy.
ReplicaSet: guarantees the Deployment keeps the correct number of identical pods running.
Pod: wraps one or more containers with shared networking and storage.
Service: exposes pods through a stable DNS/IP, with built-in load balancing.
ConfigMap / Secret: externalize configuration and sensitive values.

Cluster Strategy

Control vs. Compute Nodes: Control-plane nodes run etcd, the API server, scheduler, and controllers; worker nodes run business workloads managed by kubelet.
Namespace Isolation: group workloads by Service group (e.g., AI, Core, DataService, And observability) to simplify quota and security policies.
Scheduling & Availability: rely on node labels/taints and Pod anti-affinity to spread replicas, with Horizontal Pod Autoscaler for demand spikes and PodDisruptionBudgets to protect availability.
Cluster Health: monitor control-plane latency, node readiness, resource saturation (CPU/memory/disk), and kube-system pods to catch degradation early.
Takeaway: Kubernetes gives consistent packaging plus automated deployment mechanics. Our cluster strategy keeps workloads isolated but portable, makes rollouts safe, and ensures we can scale and recover quickly.

Large Language Models (LLMs)

The BMC AMI Platform product currently supports the following large language models (LLMs) on BMC AMI Platform:

Mixtral 8x7B
Llama 3.1 8B
Granite 3.1 8B

This topic describes these LLMs' key capabilities and hardware support. It also describes the test evaluation framework that we used to select these LLMs.

Mixtral 8x7B

Mixtral 8x7B is a state-of-the-art, instruction-tuned LLM developed by Mistral AI. It is an instruction-based model quantized to 3-bit to reduce hardware requirements and improve computational efficiency. It employs a Sparse Mixture of Experts (SMoE) architecture, allowing for high performance with a relatively compact model size. It is trained on diverse data, providing a broad understanding while delivering compact yet accurate responses across various tasks.

Key capabilities

High performance: Outperforms many other open-source LLMs across various benchmarks
Efficiency: The SMoE architecture enables faster inference, making it more cost-effective
Context length: Supports up to 32,768 tokens, allowing for more complex and detailed interactions
Language support: Supports multiple languages, including English, French, Italian, German, and Spanish
Code generation: Excels in code generation tasks
Instruction following: Can be fine-tuned to follow instructions effectively

Benefits of the SMoE architecture

Increased capacity: Allows for a larger model size without sacrificing efficiency
Improved performance: Boosts performance on various tasks
Reduced cost: Uses only a subset of the model’s parameters for each input, reducing computational costs

Hardware support

GPU support: Optimized for running efficiently on GPU clusters, leveraging its architecture for faster and more resource-efficient inference.

Llama 3.1 8B

Meta Llama 3.1 8B is an eight-billion-parameter instruction-tuned model optimized for query-answer tasks.

This model is pre-trained, instruction-tuned, and optimized for dialogue use cases. It outperforms many open-source chat models on industry benchmarks. Llama3.1 was developed to optimize helpfulness and safety.

Key capabilities

Auto-Regressive model: Uses an optimized transformer architecture for efficient text generation
Instruction-tuned: Models are fine-tuned via Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF) to align with human preferences for helpfulness and safety
Context length: Handles up to 128k tokens, allowing for detailed interactions
Benchmark performance: Excels in dialogue and assistant-like chat tasks, outperforming other open-source models
Language support: Primarily optimized for English, but can be adapted for a variety of natural language generation tasks

Hardware support

GPU support: Optimized for GPU, allowing faster training and inference.

Granite 3.1 8B

Granite 3.1 8B Instruct is a powerful instruction-tuned LLM designed for a wide range of natural language tasks. It is built on the foundation of the Granite 3.1 base model, a state-of-the-art LLM that has been pre-trained on a massive dataset of text and code.

The Granite 3.1 8B Instruct model has been further fine-tuned on a diverse set of instruction data to improve its ability to follow instructions, write different kinds of creative content, and answer your questions in an informative way.

Key capabilities

Instruction following: The model is highly adept at understanding and following instructions, making it suitable for a wide range of tasks, from simple question answering to complex content generation.
High performance: The model delivers state-of-the-art performance on a variety of benchmarks, including question answering, text generation, and translation.
Text generation: The model can generate high-quality text, including stories, poems, articles, and code.
Translation: The model can translate text between multiple languages.
Versatility: The model can be used for a wide range of tasks, making it a valuable tool for businesses and researchers.
Ease of use: The model is easy to use and can be accessed through a variety of APIs.

Hardware support

GPU support: The model is optimized for running on GPUs, which can significantly accelerate its performance.

Test evaluation framework

We tested multiple LLMs and selected models producing higher-quality outcomes based on evaluations from subject-matter experts.

Following the selection, we fine-tuned inference hyperparameters and applied effective prompt engineering to further improve the output quality.

We automatically evaluated all tests by using an in-house test evaluation framework to maintain and benchmark result quality. Each response was assessed against the following groups of metrics:

Similarity metrics
Generation metrics
End-to-end metrics

Key concepts

Kubernetes

Large Language Models (LLMs)

Mixtral 8x7B

Key capabilities

Benefits of the SMoE architecture

Hardware support

Llama 3.1 8B

Key capabilities

Hardware support

Granite 3.1 8B

Key capabilities

Hardware support

Test evaluation framework

BMC AMI Platform 2.0

On this page