Skip to content

tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

JIRA: https://issues.redhat.com/browse/RHEL-11592

commit b650d953cd391595e536153ce30b4aab385643ac
Author: mfreemon@cloudflare.com mfreemon@cloudflare.com
Date: Sun Jun 11 22:05:24 2023 -0500

tcp: enforce receive buffer memory limits by allowing the tcp window to shrink  

Under certain circumstances, the tcp receive buffer memory limit  
set by autotuning (sk_rcvbuf) is increased due to incoming data  
packets as a result of the window not closing when it should be.  
This can result in the receive buffer growing all the way up to  
tcp_rmem[2], even for tcp sessions with a low BDP.  

To reproduce:  Connect a TCP session with the receiver doing  
nothing and the sender sending small packets (an infinite loop  
of socket send() with 4 bytes of payload with a sleep of 1 ms  
in between each send()).  This will cause the tcp receive buffer  
to grow all the way up to tcp_rmem[2].  

As a result, a host can have individual tcp sessions with receive  
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem  
limits, causing the host to go into tcp memory pressure mode.  

The fundamental issue is the relationship between the granularity  
of the window scaling factor and the number of byte ACKed back  
to the sender.  This problem has previously been identified in  
RFC 7323, appendix F [1].  

The Linux kernel currently adheres to never shrinking the window.  

In addition to the overallocation of memory mentioned above, the  
current behavior is functionally incorrect, because once tcp_rmem[2]  
is reached when no remediations remain (i.e. tcp collapse fails to  
free up any more memory and there are no packets to prune from the  
out-of-order queue), the receiver will drop in-window packets  
resulting in retransmissions and an eventual timeout of the tcp  
session.  A receive buffer full condition should instead result  
in a zero window and an indefinite wait.  

In practice, this problem is largely hidden for most flows.  It  
is not applicable to mice flows.  Elephant flows can send data  
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),  
triggering a zero window.  

But this problem does show up for other types of flows.  Examples  
are websockets and other type of flows that send small amounts of  
data spaced apart slightly in time.  In these cases, we directly  
encounter the problem described in [1].  

RFC 7323, section 2.4 [2], says there are instances when a retracted  
window can be offered, and that TCP implementations MUST ensure  
that they handle a shrinking window, as specified in RFC 1122,  
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window  
management have made clear that sender must accept a shrunk window  
from the receiver, including RFC 793 [4] and RFC 1323 [5].  

This patch implements the functionality to shrink the tcp window  
when necessary to keep the right edge within the memory limit by  
autotuning (sk_rcvbuf).  This new functionality is enabled with  
the new sysctl: net.ipv4.tcp_shrink_window  

Additional information can be found at:  
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/  

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F  
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4  
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91  
[4] https://www.rfc-editor.org/rfc/rfc793  
[5] https://www.rfc-editor.org/rfc/rfc1323  

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>  
Reviewed-by: Eric Dumazet <edumazet@google.com>  
Signed-off-by: David S. Miller <davem@davemloft.net>  

Signed-off-by: Felix Maurer fmaurer@redhat.com

Merge request reports

Loading