Skip to content

Resolve "GPU Deployment Issues in Relevant Refactored Modules (Cross Vendor GPUs, TTY Configuration, Log Handling)"

Merge Request

Description

This merge request addresses the GPU deployment issues described in Issue #466. It reinstates host configurations for AMD GPUs, ensures proper device mappings for Intel GPUs, enables TTY in container configurations, simplifies image name determination, and updates log handling. Additionally, it reintroduces logic to select the GPU with the highest available free VRAM irrespective of the vendor. These changes are critical for resolving deployment issues and improving system robustness.

Type of Change

  • feat: A new feature for the user or a significant enhancement
  • fix: A bug fix
  • refactor: Code changes that neither fix a bug nor add a feature
  • perf: Performance improvements
  • test: Adding or modifying tests
  • style: Code style changes (e.g., formatting)
  • docs: Documentation changes
  • revert: Reverting a previous commit
  • build: Changes that affect the build system or external dependencies (e.g. npm, docker, nexus)
  • chore: Routine tasks, maintenance, or general refactoring
  • ci: Changes to the project's Continuous Integration (CI) configuration
  • release: Version number changes, typically associated with creating a new release
  • other (please specify): Brings back previously functional aspects of GPU deployment through DMS containers currently missing in refactored code

Semantic Versioning

  • This MR introduces a breaking change. (If yes, use the prefix BREAKING CHANGE: <merge request title> and ensure version bumps to major.)
  • This MR adds a new feature in a backwards-compatible manner. (If yes, use the prefix feat: <merge request title> and ensure version bumps to minor.)
  • This MR includes critical bug fixes or patches in a backwards-compatible manner bringing back necessary features. (If yes, use the prefix - fix: <merge request title> and ensure version bumps to patch.)
  • This MR does not necessitate a change in the version, since it contains routine maintenance, housekeeping, or improvements that don't affect the external API or user experience. (If yes, use the prefix that fits better e.g., build/chore/ci - chore: <merge request title> and ensure version doesn't bump.)

Specification and Related Issues

Checklist

  • I have tested the changes thoroughly
  • The code follows the project's coding standards
  • This MR is linked to the specification and complies with it
  • Documentation has been updated or added
  • Tests were added/modified to cover the changes
  • All tests pass successfully
  • The branch is up-to-date with the target branch
  • This MR is linked to any relevant issues
  • Assign this MR to the appropriate reviewer(s)
  • I have run swag init to update the swagger docs

Additional Information

This MR reinstates and simplifies the image name determination using a switch statement based on the GPU vendor. It ensures the GPU with the highest available free VRAM is selected, irrespective of the vendor, and the corresponding vendor-specific container is launched. It also addresses the following:

  • Reintroduce the binding and device mappings for AMD GPUs as implemented in DMS v0.4 in executor.go.
  • Add specific bindings and device mappings for Intel GPUs in in executor.go.
  • Updated GPU info management.
  • Set Tty: true, AttachStdout: true, and AttachStderr: true in the container configuration.
  • Update handler.go to handle logs as single combined output stream and prevent blank outputs for applications running as jobs.
  • Added gpu_resource_test.go and gpu_executor_test.go to ensure tests cover the new configurations.
  • Optimizing GPU utilization by selecting the GPU with the highest available free VRAM, ensuring better performance and resource usage.
Edited by Avimanyu Bandyopadhyay

Merge request reports

Loading