baremetalHosts powered off without reasons during rolling updates

During rolling updates, some machines remain stuck on provisionning, as in this CI run: https://gitlab.com/sylva-projects/sylva-core/-/jobs/7852801460

v1.27.16
sylva-system         mgmt-1458191268-kubeadm-capm3-virt-control-plane-dblfv   mgmt-1458191268-kubeadm-capm3-virt                                                                                                                                                                                          Provisioning   54m   v1.28.12
sylva-system         mgmt-1458191268-kubeadm-capm3-virt-control-plane-rt8bw   mgmt-1458191268-kubeadm-capm3-virt   mgmt-1458191268-kubeadm-capm3-virt-management-cp-2   metal3://sylva-system/mgmt-1458191268-kubeadm-capm3-virt-management-cp-2/mgmt-1458191268-kubeadm-capm3-virt-control-plane-rt8bw   Running        92m   v1.28.12
sylva-system         mgmt-1458191268-kubeadm-capm3-virt-control-plane-s7qbm   mgmt-1458191268-kubeadm-capm3-virt   mgmt-1458191268-kubeadm-capm3-virt-management-cp-0   metal3://sylva-system/mgmt-1458191268-kubeadm-capm3-virt-management-cp-0/mgmt-1458191268-kubeadm-capm3-virt-control-plane-s7qbm   Running        92m   v1.28.12
sylva-system         mgmt-1458191268-kubeadm-capm3-virt-md0-jbzpl-q56fg       mgmt-1458191268-kubeadm-capm3-virt   mgmt-1458191268-kubeadm-capm3-virt-management-md-0   metal3://sylva-system/mgmt-1458191268-kubeadm-capm3-virt-management-md-0/mgmt-1458191268-kubeadm-capm3-virt-md0-jbzpl-q56fg       Running        56m   v1.28.12

The BMH is in provisioning state, with online: true:

    operationHistory:
      deprovision:
        end: "2024-09-18T08:49:48Z"
        start: "2024-09-18T08:49:37Z"
      inspect:
        end: "2024-09-18T07:51:40Z"
        start: "2024-09-18T07:46:51Z"
      provision:
        end: null
        start: "2024-09-18T08:57:52Z"
      register:
        end: "2024-09-18T08:49:52Z"
        start: "2024-09-18T08:49:51Z"

But we see that machine has been powered off in libvirt-console logs:

[2024-09-18 08:58:35,610] INFO in main: System "c0014001-b10b-f001-c0de-feeb1e54ee15" power state set to "ForceOff"

While looking at ironic logs in the bootstrap cluster, we can see that it has powered off the VM:

2024-09-18 08:58:35.183 1 DEBUG sushy.connector [None req-4fd42ab1-69bd-4f90-a788-f73ed40ee46c - - - - - -] HTTP request: POST https://172.18.0.2:8001/redfish/v1/Systems/c0014001-b10b-f001-c0de-feeb1e54ee15/Actions/ComputerSystem.Reset; headers: {'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'ResetType': 'ForceOff'}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.11/site-packages/sushy/connector.py:149[00m
2024-09-18 08:58:35.611 1 DEBUG sushy.connector [None req-4fd42ab1-69bd-4f90-a788-f73ed40ee46c - - - - - -] HTTP response for POST https://172.18.0.2:8001/redfish/v1/Systems/c0014001-b10b-f001-c0de-feeb1e54ee15/Actions/ComputerSystem.Reset: status code: 204 _op /usr/lib/python3.11/site-packages/sushy/connector.py:283[00m

This is unexpected as there are no more BMH defined in bootstrap cluster after pivoting, there is probably some bug in metal3. In the meantime we should unistall ironic from bootstrap after pivoting.

Edited Sep 20, 2024 by Francois Eleouet
Assignee Loading
Time tracking Loading