Object Storage should have a strict TTFB timeout
We had/have an outage related to GCS used by GitLab Pages:
TTFB
We clearly see that TTFB sky rockets once we disabled NFS. In regular operation we expect 25ms
, during outage of GCS we were seeing around 20s
, but once we disabled it appears that some requests had TTFB close to "a few minutes".
Connections
This created a crazy backlog of 6k open connections to GCS. Our timeout on the connection is 30 minutes.
Summary
This outage showed some aspects to improve in Pages related how Pages handles timeouts, and what timeouts it should use to ensure sane service recovery and reducing amplification pressure due to outage happening elsewhere.
Proposal
Define TTFB to be no longer than 15s. This should ensure that "sane amount" of requests is opened, and system can fast-reject instead of hang.
var httpClient = &http.Client{
// The longest time the request can be executed
Timeout: 30 * time.Minute,
Transport: httptransport.NewTransportWithMetrics(
"httprange_client",
metrics.HTTPRangeTraceDuration,
metrics.HTTPRangeRequestDuration,
metrics.HTTPRangeRequestsTotal,
),
}
We should configure the:
-
TLSHandshakeTimeout time.Duration
, aka before being able to writerequest
-
ResponseHeaderTimeout time.Duration
, aka close toTTFB
-
ExpectContinueTimeout time.Duration
, aka similar toTTFB
Or, just implement that differently as part of RoundTrip of httprange
to define a timeout to receiving response (but not yet fully reading it).
More here: https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/