Time wasted in poll() when network is busy, bandwidth 25% of what it should be.

I am using ocserv-1.1.1-1.fc33.x86_64 and attempting to do some bandwidth tests.

I only have 1G Ethernet at home, running 'iperf3 -u -b 1000M' between the boxes reports about 960Mb/s in either direction, while 'iftop' shows 930Mb/s during the test run.

I run ocserv (using the openconnect test-user-pass.config) on one box (AMD Ryzen 5 3400G), and openconnect on another (an older Haswell workstation 'Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz'). It uses (DTLS1.2)-(PSK)-(AES-128-GCM) but that doesn't seem very relevant as the crypto doesn't show up on any of the perf traces.

Transmitting over the VPN from the client seems tolerable although there is packet loss:

[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  1.05 GBytes   900 Mbits/sec  0.000 ms  0/833304 (0%)  sender
[  5]   0.00-10.02  sec   945 MBytes   792 Mbits/sec  63.651 ms  98758/833033 (12%)  receiver

iftop was reporting the same 930Mb/s when this test happened, and openconnect was saturating the network while using only about 60% of a CPU on the Haswell side. I think the packet loss is on the receive side, although ocserv-worker is only taking 60% of a CPU there too.

I'd investigate that, but there's a much bigger problem in the other direction, where I only get about a quarter of a Gigabit of traffic:

[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  1.05 GBytes   900 Mbits/sec  0.000 ms  0/833605 (0%)  sender
[  5]   0.00-10.00  sec   282 MBytes   237 Mbits/sec  0.006 ms  612185/831590 (74%)  receiver

During that test run, the 'vpns0' interface reports 607088 dropped packets — those packets are being dropped because ocserv isn't reading them out of the /dev/net/tun device quickly enough. iftop confirms that only that amount of traffic is leaving the box; it's not being dropped on the client side. But ocserv-worker is only taking 10% CPU time; what's it doing?

14:00:55.230518 ppoll([{fd=15, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 4, {tv_sec=10, tv_nsec=0}, [], 8) = 1 ([{fd=5, revents=POLLIN}], left {tv_sec=9, tv_nsec=999998948})
14:00:55.230541 read(5, "E\0\5b0H@\0@\21\2013\300\250\1\1\300\250\1\276\24Q\273[\5N\370[\0/\3112"..., 1414) = 1378
14:00:55.230560 sendto(6, "\27\376\375\0\1\0\0\0\20\352\224\5{\0\1\0\0\0\20\352\224d\276\251g\236/\1\17|#\277"..., 1416, 0, NULL, 0) = 1416
14:00:55.230584 ppoll([{fd=15, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 4, {tv_sec=10, tv_nsec=0}, [], 8) = 1 ([{fd=5, revents=POLLIN}], left {tv_sec=9, tv_nsec=999999089})
14:00:55.230607 read(5, "E\0\5b0\337@\0@\21\200\234\300\250\1\1\300\250\1\276\24Q\273[\5N\360/\0/\3112"..., 1414) = 1378
14:00:55.230642 sendto(6, "\27\376\375\0\1\0\0\0\20\352\225\5{\0\1\0\0\0\20\352\225\32~~\246MIr\216\261\273<"..., 1416, 0, NULL, 0) = 1416
14:00:55.230666 ppoll([{fd=15, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 4, {tv_sec=10, tv_nsec=0}, [], 8) = 1 ([{fd=5, revents=POLLIN}], left {tv_sec=9, tv_nsec=999995642})
14:00:55.230705 read(5, "E\0\5b0\340@\0@\21\200\233\300\250\1\1\300\250\1\276\24Q\273[\5N\360$\0/\3112"..., 1414) = 1378
14:00:55.230831 sendto(6, "\27\376\375\0\1\0\0\0\20\352\226\5{\0\1\0\0\0\20\352\2262\331\203\232\3049\26\332t\2\375"..., 1416, 0, NULL, 0) = 1416
14:00:55.230864 ppoll([{fd=15, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 4, {tv_sec=10, tv_nsec=0}, [], 8^C) = 1 ([{fd=5, revents=POLLIN}], left {tv_sec=9, tv_nsec=999998087})

It's... polling! It's 20-30µs per system call, it's using two system calls for every packet, then wasting another on a poll(). Even before we talk about using vhost-net to eliminate the read() and sendmmsg() to reduce the time in sendto(), we should be able to get a long way by eliminating the polling when it's busy.

The strace will have perturbed timings a lot. Perf shows it taking 18% of its time in __x64_sys_ppoll() but I think that's only 18% of its running time, and doesn't account for the time is then spends off the CPU because it was scheduled out.

+   69.98%     0.54%  ocserv-worker  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe      ◆
-   68.74%     0.61%  ocserv-worker  [kernel.kallsyms]     [k] do_syscall_64                       ▒
   - 68.13% do_syscall_64                                                                          ▒
      + 31.00% __x64_sys_sendto                                                                    ▒
      + 18.00% __x64_sys_ppoll                                                                     ▒
      + 17.35% ksys_read                                                                           ▒
      + 0.86% syscall_enter_from_user_mode                                                         ▒
      + 0.67% __x64_sys_clock_nanosleep

In OpenConnect we use a model of "if we did anything this time around the loop, go around again without polling", which means that with vhost-net and sendmmsg() I can actually get to the point where strace shows just a bunch of sendmmsg() calls, and the occasional eventfd_write() to kick the vhost-net side if the tun run gets empty and the consumer need to be woken.

On a Skylake VM with a 10G Ethernet I can move 2.75Gb/s of traffic from OpenConnect, and there's still some fairly low-hanging fruit. But it's hard to find something to test against, when the kernel's ESP implementation can only do less than half that, and ocserv is an order of magnitude behind ;)