Skip to content

Replace usage of vis_vram with vram through AMD GPU related codebase

Description

This Merge Request (MR) introduces changes aimed at correcting the reporting of GPU VRAM in our device-management-service for AMD GPUs. Previously, our codebase utilized the vis_vram parameter when invoking rocm-smi, which resulted in reporting only the visible portion of the VRAM. This approach led to underreporting the total VRAM, especially for high-capacity GPUs such as the AMD Radeon RX 5700 XT. To ensure accurate reporting and utilization of AMD GPUs, the following modifications have been implemented:

  • Replaced vis_vram with vram in all instances within the rocm-smi command executions across the codebase. This change guarantees that the total VRAM is accurately reported, aligning with the GPU's specifications.
  • Adjusted the regular expression patterns in amd.go and related files to match the output format of rocm-smi when using the vram parameter.
  • Updated the GPU status reporting functions in nunet scripts to accurately calculate and display the total and used VRAM based on the vram parameter.

These changes address the discrepancy between reported and actual VRAM, ensuring that the device-management-service accurately reflects the hardware capabilities of AMD GPUs. More details here.

Checklist

  • I have updated the @version string in main.go. See https://semver.org/
  • I have updated CHANGELOG.md with a short description of the changes. This is for end-user quick reference.
  • I have run swag init to update the swagger docs. Note: This is only applicable if you added/deleted any REST endpoints.

Closes #364

Edited by Dagim Sisay

Merge request reports