Failure when creating more than ten pods simultaneously with bond-cni plugin

Summary

Pods created simultaneously is limited to less than ten when bond-cni is used for the pods.

Steps to reproduce

A CNF using pods with SRIOV and bond-cni ports executes a "system reset" command, all of the POD will be rebuilt simultaneously, but not all of the pods will be created successfully.

What is the current bug behavior?

Pods are not created and following logs are seen:

(combined from similar events): Failed to create pod sandbox: 
  rpc error:code = Unknown desc = failed to setup network for sandbox"***":grinning: 
  plugin type="multus" name="***" failed (add):[***/***/***:***-bond]:grinning: 
  error adding container to network"***-bond": 
   Failed to lookup physicalfunctions links, error: interrupted system call

The failure ‘Failed to lookup physicalfunctions links, error’ is from the following line of code: https://github.com/k8snetworkplumbingwg/bond-cni/blob/master/bond/util/validation.go .

What is the expected correct behavior?

Pods to be created with no errors, with correct bond configuration inside the pod.

Possible fixes

CNF vendor suspects the root cause is linked to netlink library version and that netlink version v1.3.1 (modification is described in https://github.com/vishvananda/netlink/pull/1018) will mitigate the issue.

We see a recent commit from bond-cni code actually using netlink v1.3.1 (details below), so a potential fix would already be available.

In order to test this fix, we would need to have a new image for rancher/hardened-cni-plugins container (cni-plugins initContainer, part of multus pod, responsible for installing the bond-cni binary on the node):

$ kubectl -n kube-system logs multus-wdrpb -c cni-plugins | grep bond
copied /opt/cni/bin/bond to /host/opt/cni/bin correctly
$ kubectl get pods -n kube-system multus-wdrpb -o yaml | yq .spec.initContainers[0].image
rancher/hardened-cni-plugins:v1.7.1-build20250611

That new hardened-cni-plugins image tag will need to use the commit id 0945e95a2e6e9ff911d698d4c0764a7c64dcbd02 (from May 16 2025) or a more recent commit from the bond-cni code, since starting with that commit, netlink v1.3.1 is used inside bond-cni code.

The latest tag (and main branch) from Rancher image-build-cni-plugins repo is based on an older bond-cni commit: 80bef2cd60be32bef9dc08b1a30aaea5282c0311, from 7 Jan 2025.

Edited Aug 28, 2025 by Thomas Morin
Assignee Loading
Time tracking Loading