Skip to content

atlantic: fix deadlock at aq_nic_stop

Íñigo Huguet requested to merge ihuguet-rh/kernel-cs9:atlantic_deadlock into main

Bugzilla: https://bugzilla.redhat.com/2125601
Tested: tested during the development for upstream, using reproducer from QA and stress test suggested by maintainer

NIC is stopped with rtnl_lock held, and during the stop it cancels the 'service_task' work and free irqs.

However, if CONFIG_MACSEC is set, rtnl_lock is acquired both from aq_nic_service_task and aq_linkstate_threaded_isr. Then a deadlock happens if aq_nic_stop tries to cancel/disable them when they've already started their execution.

As the deadlock is caused by rtnl_lock, it causes many other processes to stall, not only atlantic related stuff.

Fix it by introducing a mutex that protects each NIC's macsec related data, and locking it instead of the rtnl_lock from the service task and the threaded IRQ.

Before this patch, all macsec data was protected with rtnl_lock, but maybe not all of it needs to be protected. With this new mutex, further efforts can be made to limit the protected data only to that which requires it. However, probably it doesn't worth it because all macsec's data accesses are infrequent, and almost all are done from macsec_ops or ethtool callbacks, called holding rtnl_lock, so macsec_mutex won't never be much contended.

The issue appeared repeteadly attaching and deattaching the NIC to a bond interface. Doing that after this patch I cannot reproduce the bug.

Signed-off-by: Íñigo Huguet ihuguet@redhat.com

Merge request reports