Resolve "GPU Deployment Issues in Relevant Refactored Modules (Cross Vendor GPUs, TTY Configuration, Log Handling)"
Merge Request
Description
This merge request addresses the GPU deployment issues described in Issue #466. It reinstates host configurations for AMD GPUs, ensures proper device mappings for Intel GPUs, enables TTY in container configurations, simplifies image name determination, and updates log handling. Additionally, it reintroduces logic to select the GPU with the highest available free VRAM irrespective of the vendor. These changes are critical for resolving deployment issues and improving system robustness.
Type of Change
-
feat: A new feature for the user or a significant enhancement -
fix: A bug fix -
refactor: Code changes that neither fix a bug nor add a feature -
perf: Performance improvements -
test: Adding or modifying tests -
style: Code style changes (e.g., formatting) -
docs: Documentation changes -
revert: Reverting a previous commit -
build: Changes that affect the build system or external dependencies (e.g. npm, docker, nexus) -
chore: Routine tasks, maintenance, or general refactoring -
ci: Changes to the project's Continuous Integration (CI) configuration -
release: Version number changes, typically associated with creating a new release -
other (please specify): Brings back previously functional aspects of GPU deployment through DMS containers currently missing in refactored code
Semantic Versioning
-
This MR introduces a breaking change. (If yes, use the prefix BREAKING CHANGE: <merge request title>
and ensure version bumps tomajor
.) -
This MR adds a new feature in a backwards-compatible manner. (If yes, use the prefix feat: <merge request title>
and ensure version bumps tominor
.) -
This MR includes critical bug fixes or patches in a backwards-compatible manner bringing back necessary features. (If yes, use the prefix - fix: <merge request title>
and ensure version bumps topatch
.) -
This MR does not necessitate a change in the version, since it contains routine maintenance, housekeeping, or improvements that don't affect the external API or user experience. (If yes, use the prefix that fits better e.g., build/chore/ci
-chore: <merge request title>
and ensure version doesn't bump.)
Specification and Related Issues
- Issue #466 (closed): GPU Deployment Issues in Relevant Refactored Modules (Cross Vendor GPUs, TTY Configuration, Log Handling)
Checklist
-
I have tested the changes thoroughly -
The code follows the project's coding standards -
This MR is linked to the specification and complies with it -
Documentation has been updated or added -
Tests were added/modified to cover the changes -
All tests pass successfully -
The branch is up-to-date with the target branch -
This MR is linked to any relevant issues -
Assign this MR to the appropriate reviewer(s) -
I have run swag init
to update the swagger docs
Additional Information
This MR reinstates and simplifies the image name determination using a switch statement based on the GPU vendor. It ensures the GPU with the highest available free VRAM is selected, irrespective of the vendor, and the corresponding vendor-specific container is launched. It also addresses the following:
- Reintroduce the binding and device mappings for AMD GPUs as implemented in DMS v0.4 in
executor.go
. - Add specific bindings and device mappings for Intel GPUs in in
executor.go
. - Updated GPU info management.
- Set
Tty: true
,AttachStdout: true
, andAttachStderr: true
in the container configuration. - Update
handler.go
to handle logs as single combined output stream and prevent blank outputs for applications running as jobs. - Added
gpu_resource_test.go
andgpu_executor_test.go
to ensure tests cover the new configurations. - Optimizing GPU utilization by selecting the GPU with the highest available free VRAM, ensuring better performance and resource usage.