Add GPU saturation monitoring
As a follow-on from !125 (merged), we now have GPU monitoring on the triton server.
We should add GPU saturation monitoring based on these metrics.
$ curl --silent --fail localhost:8082/metrics|grep gpu
# HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-c06bbf33-c250-ca45-579b-19134b518844"} 0.000000
# HELP nv_gpu_memory_total_bytes GPU total memory, in bytes
# TYPE nv_gpu_memory_total_bytes gauge
nv_gpu_memory_total_bytes{gpu_uuid="GPU-c06bbf33-c250-ca45-579b-19134b518844"} 42949672960.000000
# HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-c06bbf33-c250-ca45-579b-19134b518844"} 33763098624.000000
# HELP nv_gpu_power_usage GPU power usage in watts
# TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-c06bbf33-c250-ca45-579b-19134b518844"} 51.091000
# HELP nv_gpu_power_limit GPU power management limit in watts
# TYPE nv_gpu_power_limit gauge
nv_gpu_power_limit{gpu_uuid="GPU-c06bbf33-c250-ca45-579b-19134b518844"} 400.000000
nv_energy_consumption{gpu_uuid="GPU-c06bbf33-c250-ca45-579b-19134b518844"} 3594017.400000
cc @lmcandrew @rnienaber @mray2020 @reprazent @cfeick
GPU Saturation Types to Add
- GPU Utilization:
nv_gpu_utilization - GPU Memory:
nv_gpu_memory_used_bytes/nv_gpu_memory_total_bytes - GPU Power Usage:
nv_gpu_power_usage/nv_gpu_power_limit