Investigate HTTP 499 responses from Cloudflare to clients

Problem statement

Roughly 1 million times per day, Cloudflare is responding to clients with the (unintuitive) HTTP response code 499 "Client Closed Request". At least some (maybe all) of these requests we slow enough to suggest timeout behavior (i.e. 15 seconds, 20 seconds, etc.).

Some of these events correlate with intermittently failing attempts by CI Runners to upload job trace output to https://gitlab.com/api/v4/jobs/XXXX/trace. Other HTTP endpoints (including web and API endpoints) can also be affected.

This may be a symptom of intermittent connectivity delay or failure between Cloudflare and origin, resulting in slowness or errors from the client's perspective. If this inferred bottleneck occurs prior to requests reaching workhorse and rails, it may not be detected by our standard response time and saturation metrics, so this may be a hint that we have an undiagnosed bottleneck.

Background

Recently we investigated several reports of CI Runners intermittently failing to upload job trace output to the GitLab.com API endpoint. The following comments summarize the findings so far:

gitlab-org/gitlab#220041 (comment 354064702) -- Cloudflare logs showed the example request arriving, but HAProxy and Workhorse logs did not. Cloudflare's log shows that it waited 15 seconds before sending the HTTP 499 response to its client (CI Runner).
gitlab-org/gitlab#220041 (comment 355768731) -- Additional examples of failed CI Runner job trace uploads exhibit the same pattern as the 1st example described above, suggesting this is a consistent pattern.

The Cloudflare Analytics dashboard shows that it frequently sends HTTP 499 responses (1-2 million per day on weekdays and 0.5 million per day on weekends).

Not all of these HTTP 499 responses necessarily share the same cause. However, an initial review shows that many of them do exhibit very long durations: EdgeEndTimestamp minus EdgeStartTimestamp is multi-modal with local spikes at 15 seconds, 20 seconds, etc. Many but not all of these slow HTTP 499 responses do result from requests to the job trace upload endpoint (/api/v4/jobs/[job_id]/trace).

Anecdotally, people started noticing failed CI job trace output uploads a few weeks ago, but Cloudflare's HTTP 499 response rate has been high for longer than that. This may mean the problem went unnoticed for a while, or it may mean the CI job trace uploads somehow became more prone to the problem than before, or the HTTP 499 response rate may be a false lead.

Tentative interpretation and potential remedy

This is far from proven, but as a working hypothesis...

When Cloudflare sends its client an HTTP 499 response after a delay of many seconds, this may indicate that Cloudflare's connection to origin has in some way failed or become unresponsive, and after reaching some timeout or retry threshold, Cloudflare gives up talking to origin and responds to its client with HTTP 499 "Client Closed Request".

Note: We know from CI Runner logs that Cloudflare really does send an HTTP 499 response; it's not just a log event. Cloudflare's documentation does not explain what conditions could cause it to send HTTP 499 to its caller unless it got that response from origin. But because Cloudflare's logs of HTTP 499 events consistently have no values for the Origin* fields, I'm guessing that when Cloudflare says "Client Closed Request", it may be referring to itself as the client to our origin. However, Cloudflare typically uses other more distinctive response codes (HTTP 520-527) to describe trouble talking to origin, so I'm not sure why it would ever choose to use HTTP 499. But lacking a better explanation, this is my current best guess.

If that interpretation is correct, then one potential explanation for stalled API calls is haproxy itself starving for TCP client ports on the loopback interface where it proxies back to itself:

backend api_rate_limit
    option splice-auto
    mode http
    server localhost 127.0.0.1:4444 send-proxy

Note that because this configuration is specific to API calls, this would not explain Cloudflare's HTTP 499 responses to non-API requests. So this hypothesis can at most only explain part of the problem.

These localhost connections from backend api_rate_limit are very short-lived and occur very frequently, so they have to aggressively reuse TCP client ports. Normally the kernel imposes a cool-down period before a client port can be reused to connect again to the same remote endpoint, which can lead to port starvation. In this hypothetical scenario, Cloudflare would successfully establish a TCP and SSL connection to HAProxy's https frontend, and HAProxy would receive the HTTP request, but then HAProxy stalls when it routes that HTTP request to its api_rate_limit backend because it cannot open a new TCP connection back to itself on port 4444. So from Cloudflare's perspective, origin accepted the request and is just slow to respond; but from the perspective of our rails app, workhorse, and even haproxy's api backend, the request never even arrived, because it's stuck in haproxy's backend api_rate_limit waiting to open a TCP connection via loopback to frontend api_rate_limit.

On a related note, I recently discovered that haproxy may try to avoid port starvation under certain conditions by asking the kernel to skip the port-reuse cool-down period (by closing connections via TCP RESET instead of exchanging TCP FINs), but I'm not yet sure when haproxy invokes that tactic or how effective it is.

When this idea of a TCP-layer bottleneck first occurred to me here and later here, I initially was thinking of SYN retries as being another symptom of the port starvation behavior, but that's not quite right. If there are no available client ports, the connection attempt stalls before even sending the initial SYN. The SYN retransmit counter indicates a different kind of stall, where the listener has not promptly accepted the connection request. So the fact that SYN retries are occasionally occurring on the loopback to port 4444 is one definite indication that our api_rate_limit backend is sometimes struggling at the TCP layer, and the hypothetical (untested) port starvation scenario described above would be an independent TCP layer bottleneck.

Why is this happening now?

When we recently reduced the number of fe-XX hosts (i.e. when we upgraded their machine type to get faster CPUs), we implicitly concentrated that TCP connection rate onto fewer TCP stacks. That would amplify any TCP resource contention and may have nudged haproxy closer to a tipping point.

Possible remedies

If the above interpretations are correct that a TCP layer bottleneck on the haproxy hosts are causing intermittent slowness and errors, then the following remedies may help.

As a tentative short-term remedy, re-adding some fe-XX hosts to the pool should reduce the per-host connection rate, plausibly reducing the risk of both port starvation and SYN retry.

As a long-term remedy, we could rewrite our haproxy config for the API rate limiting logic to stop making frequent short-lived loopback connections to itself. One way to do this is to merge the frontend api_rate_limit stanza into the frontend https stanza so we can remove the proxying backend api_rate_limit stanza. Another option is to try @igorwwwwwwwwwwwwwwwwwwww's suggestion to let backend api_rate_limit use long-lived TCP connections to frontend api_rate_limit. This would reduce TCP connection churn with a smaller change to the haproxy config.

Edited Jun 18, 2020 by Matt Smiley