Fix CDS breaking when the set of nodes gets smaller

Downscaling a CDS (i.e. the applicable label is removed from one or more nodes) will lead to the following crash in cds-operator and prevent the CDS from getting reconciled.

2025-05-08 08:59:35,634   ERROR  yaook.op.tasks  task TaskItem(func=<bound method OperatorDaemon._reconcile_cr of <yaook.op.daemon.OperatorDaemon object at 0x7f1979649750>>, data=(<CustomResource configureddaemonsets.apps.yaook.cloud/v1>, 'yaook', 'oct
avia-health-manager-zvmcg')) failed. retrying in 123.0645667974818s
Traceback (most recent call last):
  ...
  File "/usr/local/lib/python3.11/site-packages/yaook/statemachine/resources/instancing.py", line 1029, in _compile_full_state
    node = await v1.read_node(node_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
    raise e
  File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
    response_data = await self.request(
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
    return (await self.request("GET", url,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/rest.py", line 187, in request
    raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: <CIMultiDictProxy('Audit-Id': 'd0db08ea-cdce-496e-aa3e-ba071c0be7d3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '229285a4-468b-4c3a-ba89-5c600ec0b16b', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b8f4df52-7d68-4920-be78-6cbb09be562c', 'Date': 'Thu, 08 May 2025 08:59:35 GMT', 'Content-Length': '238')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes \"octavia-health-manager-zvmcg-cj5xr\" not found","reason":"NotFound","details":{"name":"octavia-health-manager-zvmcg-cj5xr","kind":"nodes"},"code":404}

It tries to look up nodes via the following code snippet in yaook.statemachine.resources.instancing.StatefulInstancedResource, which is the base class of yaook.op.cds.resources.CDSPods:

    def _get_node_name(self, resource_info: ResourceInfo) -> str:
        return resource_info.metadata["name"]

    ...

    async def _compile_full_state(
        ...
        for wrong_instance in assignment_keys - instance_keys:
            prevent_deletion = False
            resource_info = list(assignment[wrong_instance])[0]
            node_name = self._get_node_name(resource_info)
            node = await v1.read_node(node_name)

However, in case of CDS Pods, the resource_info.metadata["name"] is the instanced Pod name, not the node name.

Merge request reports

Loading