Why is HAProxy often sending TCP RESETs instead of FINs to close connections to its backend nodes?
Problem (maybe)
HAProxy is often abruptly closing connections to backend nodes using TCP RESET rather than a graceful TCP FIN exchange. This may be benign or it may be a symptom of distress.
We know the mechanism but not the cause.
Let's discover why haproxy is doing that and determine whether or not it's a practical concern.
For reference, the mechanism is that under certain conditions (to be determined), haproxy explicitly sets a socket option that forces the kernel to abruptly close the connection rather than asynchronously close it cleanly with a TCP FIN exchange. See background notes for more details.
Background
While reviewing this merge request gitlab-cookbooks/gitlab-exporters!174 (comment 363100784) to start collecting the kernel counter for outgoing TCP RESET packets, I discovered that HAProxy is sending numerous TCP RESETs to the members of its backend pools (i.e. the web-XX
, api-XX
, and git-XX
hosts that run our main gitlab rails application).
This behavior may be a benign tactic by haproxy to proactively conserve resources (e.g. avoid port starvation, reduce memory overhead, etc.).
Or it may be a symptom of trouble that warrants further investigation.
These notes in the merge request summarize the findings so far -- that haproxy induces the kernel to use RESET instead of FIN when closing a connection by enabling socket option SO_LINGER
and setting it to timeout immediately.
gitlab-cookbooks/gitlab-exporters!174 (comment 363100784)
Next steps
Find what conditions cause haproxy to use this tactic.
- In the haproxy source code,
src/fd.c
implements a cache of socket file descriptors. - Function
fd_dodelete
sets theSO_LINGER
option with a 0-second timeout (by passing a struct callednolinger
) only if that file descriptor has thelinger_risk
flag set. - Find what sets that
linger_risk
flag.
For reference here is the fd_dodelete
function in src/fd.c
of haproxy v1.8.0
:
...
161 struct fdtab *fdtab = NULL; /* array of all the file descriptors */
...
181 /* Deletes an FD from the fdsets, and recomputes the maxfd limit.
182 * The file descriptor is also closed.
183 */
184 static void fd_dodelete(int fd, int do_close)
185 {
186 HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
187 if (fdtab[fd].linger_risk) {
188 /* this is generally set when connecting to servers */
189 setsockopt(fd, SOL_SOCKET, SO_LINGER,
190 (struct linger *) &nolinger, sizeof(struct linger));
191 }
...