e1000 / e1000e randomly stop sending packets to VM with DPDK app in VM
I am experiencing problems with high data rates in a VM with emulated e1000 or e1000e NICs. This was seen with libvirt and proxmox, but after local tests with bare qemu as well.
Someone else already had the same issue in the past and created a repo to reproduce the issue:
- https://lists.gnu.org/archive/html/qemu-devel/2019-12/msg05363.html
- https://github.com/BASM/qemu_dpdk_e1000_test
I modified it a bit and tried it with qemu master. It was easy to reproduce with this setup (flood ping to VM with icmp/arp reply DPDK app running in VM).
The Problem does not occur with real e1000 / e1000e NICs, so there seem to be a race condition somewhere within qemu.
I tried to debug this stuff, but I am a bit lost.
I enabled all log messages in the e1000.c (#define E1000_DEBUG
, debugflags = ~0
).
These are the last lines befor the problem occurs:
e1000: set_ics 80, ICR 87, IMR 0
e1000: index 65: 0x5a15ef80 : 23100062 0
e1000: set_ics 2, ICR 87, IMR 0
e1000: set_ics 80, ICR 87, IMR 0
e1000: index 66: 0x5a15e640 : 23100062 0
e1000: set_ics 2, ICR 87, IMR 0
e1000: set_ics 80, ICR 87, IMR 0
e1000: index 67: 0x5a15dd00 : 23100062 0
e1000: set_ics 2, ICR 87, IMR 0
e1000: set_ics 80, ICR 87, IMR 0
e1000: index 68: 0x5a15d3c0 : 23100062 0
e1000: index 69: 0x5a15ca80 : 23100062 0
e1000: set_ics 2, ICR 87, IMR 0
The index is random at each trigger, the set_ics messages are in this order. Before the 'NIC crash', the order is always the same: set_ics 2, set_ics 80, index .... At the crash, it is always set_ics 2, set_ics 80, index, index, set_ics 2 and then it stops.
I also tried to debug using gdb. Before the crash, I could break at e1000_mmio_write(), but this is not reached after the crash. I have not found a good point to break when the packets are read from the system and then forwarded to the VM. The packets are there - when capturing using tcpdump on the tap device, I see the packets.
A restart of the DPDK app reinitializes the NIC and the packet processing is working again.
If someone gives me a good hint for further debugging, I could send more details. I could also share my changes to the github repo that I used to reproduce the issue.