SageMaker Variant (AWS_SAGEMAKER_VARIANT)
Attributes (parameters)
The following attributes are available for this monitor type:
Name | Description | Unit | Default Performance Key Indicator (KPI) |
---|---|---|---|
CPU Utilization (CPUUtilization) | The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0%–400%. For processing jobs, the value is the CPU utilization of the processing container on the instance. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. | % | No |
Disk Utilization (DiskUtilization) | The percentage of disk space used by the containers on an instance uses. This value range is 0%–100%. This metric is not supported for batch transform jobs. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance. | % | No |
GPU Memory Utilization (GPUMemoryUtilization) | The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the GPUMemoryUtilization range is 0%–400%. For endpoint variants, the value is the sum of the GPU memory utilization of the primary and supplementary containers on the instance. | % | No |
GPU Utilization (GPUUtilization) | The percentage of GPU units that are used by the containers on an instance. The value can range betweenrange is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the GPUUtilization range is 0%–400%. For endpoint variants, the value is the sum of the GPU utilization of the primary and supplementary containers on the instance. | % | No |
Memory Utilization (MemoryUtilization) | The percentage of memory that is used by the containers on an instance. This value range is 0%–100%.For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. | % | No |
Loaded Model Count (LoadedModelCount) | The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. The models that this metric tracks are not necessarily unique because a model might be loaded in multiple containers at the endpoint. | # | No |
Model Cache Hit (ModelCacheHit) | The number of InvokeEndpoint requests sent to the multi-model endpoint for which the model was already loaded. | # | No |
Model Loading Time (ModelLoadingTime) | The interval of time that it took to load the model through the container's LoadModel API call. | ms | No |
Model Downloading Time (ModelDownloadingTime) | The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3). | ms | No |
Model Unloading Time (ModelUnloadingTime) | The interval of time that it took to unload the model through the container's UnloadModel API call. | ms | No |
Model Loading Wait Time (ModelLoadingWaitTime) | The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference. | ms | No |
Model Setup Time (ModelSetupTime) | The time it takes to launch new compute resources for a serverless endpoint. The time can vary depending on the model size, how long it takes to download the model, and the start-up time of the container. | ms | No |
Overhead Latency (OverheadLatency) | The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the ModelLatency. Overhead latency can vary depending on multiple factors, including request and response payload sizes, request frequency, and authentication/authorization of the request. | ms | No |
Model Latency (ModelLatency) | The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. | ms | No |
Invocations Per Instance (InvocationsPerInstance) | The number of invocations sent to a model, normalized by InstanceCount in each ProductionVariant. 1/numberOfInstances is sent as the value on each request, where numberOfInstances is the number of active instances for the ProductionVariant behind the endpoint at the time of the request. | # | No |
Invocations(Invocations) | The number of InvokeEndpoint requests sent to a model endpoint. | # | No |
Invocation5XXErrors (Invocation5XXErrors) | The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent. | # | No |
Invocation4XXErrors (Invocation4XXErrors) | The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent. | # | No |