Problem onboarding GPU
Summary
When trying to onboarding a machine with a GPU the onboarding process does not onboard the GPU.
Steps to reproduce
This seems to affect certain machines, (I have two identical machines both running an nvidia L4s with the same problem) These machines used to onboard OK.
What is the current bug behavior?
The GPU is seen and can be selected as part of the onboarding process
However when onboarding is complete the GPU is not shown
What is the expected correct behavior?
The GPU should be onbaorded
Relevant logs and/or screenshots
ubuntu@L4-1-PoC:~$ nunet -c dms actor cmd /dms/node/onboarding/onboard -C 6 -D 60G -R 42G
NVIDIA L4
Done
-----------------------------------
Selected GPU: NVIDIA L4
Total VRAM: 24 GB
Used VRAM: 0 GB
Available VRAM: 23 GB
Enter new VRAM allocation in GB: 20
-----------------------------------
{
"success": true,
"config": {
"ID": "",
"CreatedAt": "0001-01-01T00:00:00Z",
"UpdatedAt": "0001-01-01T00:00:00Z",
"DeletedAt": "0001-01-01T00:00:00Z",
"is_onboarded": true,
"onboarded_resources": {
"cpu": {
"clock_speed": 0,
"cores": 6
},
"ram": {
"size": 42000000000
},
"disk": {
"size": 60000000000
}
}
}
}
ubuntu@L4-1-PoC:~$
ubuntu@L4-1-PoC:~$
ubuntu@L4-1-PoC:~$
ubuntu@L4-1-PoC:~$ nvidia-smi
Wed May 14 10:50:45 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:01:00.0 Off | 0 |
| N/A 32C P8 11W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
ubuntu@L4-1-PoC:~$
also running manually to check.
ubuntu@L4-1-PoC-2:~$ nvidia-smi
Wed May 14 11:03:17 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:01:00.0 Off | 0 |
| N/A 36C P8 16W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
ubuntu@L4-1-PoC-2:~$ nunet gpu list
GPU Details:
Model: NVIDIA L4, Total VRAM: 24 GB, Used VRAM: 0 GB, Vendor: NVIDIA, PCI Address: 0000:01:00.0, UUID: GPU-800784f5-3780-00e1-39ee-22d0391e5bed, Index: 0
ubuntu@L4-1-PoC-2:~$ nunet -c dms actor cmd /dms/node/onboarding/onboard -C 6 -R 40G -D 60G -G 0:23
{
"success": true,
"config": {
"ID": "",
"CreatedAt": "0001-01-01T00:00:00Z",
"UpdatedAt": "0001-01-01T00:00:00Z",
"DeletedAt": "0001-01-01T00:00:00Z",
"is_onboarded": true,
"onboarded_resources": {
"cpu": {
"clock_speed": 0,
"cores": 6
},
"ram": {
"size": 40000000000
},
"disk": {
"size": 60000000000
}
}
}
}
ubuntu@L4-1-PoC-2:~$
This is details from a L40 that is currently working I am going to re-onboard it to test
ubuntu@L40S-1-PoC:~$ nunet -c dms actor cmd /dms/node/resources/onboarded
{
"OK": true,
"Resources": {
"cpu": {
"clock_speed": 2650000000,
"cores": 7
},
"gpus": [
{
"index": 0,
"vendor": "NVIDIA",
"pci_address": "0000:01:00.0",
"model": "NVIDIA L40S",
"vram": 43000000000,
"uuid": "GPU-71a05ae8-5bb2-657b-6d00-9cd45c4d472c"
}
],
"ram": {
"size": 83900000000
},
"disk": {
"size": 1634270000000
}
}
}
ubuntu@L40S-1-PoC:~$ nvidia-smi
Wed May 14 10:51:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:01:00.0 Off | 0 |
| N/A 32C P8 32W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
ubuntu@L40S-1-PoC:~$
This is its re-onboarding
ubuntu@L40S-1-PoC:~$ nunet -c dms actor cmd /dms/node/onboarding/onboard -C 6 -R 80G -D 80G -G 0:44
{
"success": true,
"config": {
"ID": "",
"CreatedAt": "0001-01-01T00:00:00Z",
"UpdatedAt": "0001-01-01T00:00:00Z",
"DeletedAt": "0001-01-01T00:00:00Z",
"is_onboarded": true,
"onboarded_resources": {
"cpu": {
"clock_speed": 0,
"cores": 6
},
"ram": {
"size": 80000000000
},
"disk": {
"size": 80000000000
}
}
}
}
ubuntu@L40S-1-PoC:~$
This is the log of the onboarding event from that machine
ay 14 11:11:48 L40S-1-PoC rundms.sh[1063196]: [GIN] 2025/05/14 - 11:11:48 | 200 | 373.782µs | 127.0.0.1 | GET "/api/v1/actor/handle"
May 14 11:11:48 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:48.267Z#011#033[35mDEBUG#033[0m#011observability#011observability/tracing.go:326#011Operation started with new request-like transactionoperationactor_invoke_durationtrace.ide0c66de9e723c97ff3b02609f1c76f87transaction.ide0c66de9e723c97f#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "labels": ["default"]}
May 14 11:11:48 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:48.268Z#011#033[35mDEBUG#033[0m#011actor#011actor/dispatch.go:221#011dispatching message from {BAAREIETRZFTIKPBACRVXWXNGZWQFJEDNV2GTXHNE6DD7XYPLSOFR5GMPA====== did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF {12D3KooWGkrvqHLk1NonfvtDDXo5MoynFR7VYZAqDZPRUEujMsPo user-1747221108266312306}} to /dms/node/onboarding/onboard#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "labels": ["default"]}
May 14 11:11:48 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:48.268Z#011#033[35mDEBUG#033[0m#011onboarding#011onboarding/onboarding.go:284#011onboarding machine with config: {BaseDBModel:{ID: CreatedAt:0001-01-01 00:00:00 +0000 UTC UpdatedAt:0001-01-01 00:00:00 +0000 UTC DeletedAt:0001-01-01 00:00:00 +0000 UTC} IsOnboarded:false OnboardedResources:{CPU:{ClockSpeed:0 Cores:6 Model: Vendor: Threads:0 Architecture: CacheSize:0} GPUs:[] RAM:{Size:80000000000 ClockSpeed:0 Type:} Disk:{Size:80000000000 Model: Vendor: Type: Interface: ReadSpeed:0 WriteSpeed:0}}}#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "labels": ["default"]}
May 14 11:11:48 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:48.268Z#011#033[34mINFO#033[0m#011onboarding#011onboarding/onboarding.go:227#011machine_hardware_resources#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "cpuCores": 8, "ramGB": 101, "gpuCount": 1, "labels": ["node"], "es_index": "node-index"}
May 14 11:11:48 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:48.268Z#011#033[34mINFO#033[0m#011onboarding#011onboarding/onboarding.go:234#011machine_hardware_gpu#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "gpuIndex": 0, "gpuModel": "NVIDIA L40S", "gpuVramGB": 48, "gpuLogIndex": 0, "labels": ["node"], "es_index": "node-index"}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.269Z#011#033[35mDEBUG#033[0m#011hardware#011hardware/hardware.go:75#011cpu_usage_computed#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "usage": {"clock_speed":2650000000,"cores":0.020025032}, "labels": ["accounting"], "es_index": "accounting-index"}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.269Z#011#033[35mDEBUG#033[0m#011hardware#011hardware/hardware.go:84#011ram_usage_computed#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "usedMemoryBytes": 1067544576, "labels": ["accounting"], "es_index": "accounting-index"}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.270Z#011#033[35mDEBUG#033[0m#011hardware#011hardware/hardware.go:93#011disk_usage_computed#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "usedStorageBytes": 60056151552, "labels": ["accounting"], "es_index": "accounting-index"}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.270Z#011#033[35mDEBUG#033[0m#011hardware#011hardware/hardware.go:103#011gpu_usage_computed#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "gpuUUID": "GPU-71a05ae8-5bb2-657b-6d00-9cd45c4d472c", "vendor": "NVIDIA", "usedVRAM": 502398976, "labels": ["accounting"], "es_index": "accounting-index"}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.270Z#011#033[35mDEBUG#033[0m#011hardware#011hardware/hardware.go:130#011system resource usage: {CPU:{ClockSpeed:2.65e+09 Cores:0.020025032 Model: Vendor: Threads:0 Architecture: CacheSize:0} GPUs:[{Index:0 Vendor:NVIDIA PCIAddress:0000:01:00.0 Model:NVIDIA L40S VRAM:502398976 UUID:GPU-71a05ae8-5bb2-657b-6d00-9cd45c4d472c}] RAM:{Size:1067544576 ClockSpeed:0 Type:} Disk:{Size:60056151552 Model: Vendor: Type: Interface: ReadSpeed:0 WriteSpeed:0}}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: system resource available: {BaseDBModel:{ID: CreatedAt:0001-01-01 00:00:00 +0000 UTC UpdatedAt:0001-01-01 00:00:00 +0000 UTC DeletedAt:0001-01-01 00:00:00 +0000 UTC} Resources:{CPU:{ClockSpeed:2.65e+09 Cores:8 Model: Vendor: Threads:0 Architecture: CacheSize:0} GPUs:[{Index:0 Vendor:NVIDIA PCIAddress:0000:01:00.0 Model:NVIDIA L40S VRAM:48305799168 UUID:GPU-71a05ae8-5bb2-657b-6d00-9cd45c4d472c}] RAM:{Size:101242486784 ClockSpeed:0 Type:} Disk:{Size:1695237533184 Model: Vendor: Type: Interface: ReadSpeed:0 WriteSpeed:0}}}#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "labels": ["default"]}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.270Z#011#033[34mINFO#033[0m#011onboarding#011onboarding/onboarding.go:255#011machine_free_resources#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "freeCpuCores": 7.98, "freeRamGB": 100, "freeGpuCount": 1, "labels": ["node"], "es_index": "node-index"}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.270Z#011#033[34mINFO#033[0m#011onboarding#011onboarding/onboarding.go:262#011machine_free_gpu#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "gpuIndex": 0, "gpuModel": "NVIDIA L40S", "gpuVramGB": 47, "gpuLogIndex": 0, "labels": ["node"], "es_index": "node-index"}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.273Z#011#033[34mINFO#033[0m#011onboarding#011onboarding/onboarding.go:294#011onboarded_resources_assigned#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "cpuCoresAssigned": 6, "ramGBAssigned": 80, "diskMBAssigned": 76293, "gpuCountAssigned": 0, "labels": ["node"], "es_index": "node-index"}
May 14 11:11:49 L40S-1-PoC rundms.sh[1063196]: 2025-05-14T11:11:49.275Z#011#033[34mINFO#033[0m#011onboarding#011onboarding/onboarding.go:307#011machine_onboarded_successfully#011{"did": "did:key:z6MkmPkogQ1UA22mNYDVGW1aS7SwafLhvkLhwywKvXirP9TF", "labels": ["node"], "es_index": "node-index"}
Version number of NuNet components
vv0.6.0-56-c9ae8e59
SO version, emulator/virtual machine type and version, network type (including NAT type), environment variables, parameters, etc
Fluxus cloud machines (exact same issue on ecoblox test lab)
Possible fixes
Edited by Samuel Lake