Object Storage should have a strict TTFB timeout

We had/have an outage related to GCS used by GitLab Pages:

https://prometheus-app.gprd.gitlab.net/graph?g0.range_input=1h&g0.expr=avg(rate(gitlab_pages_httprange_trace_duration_sum%7Brequest_stage%3D%22httptrace.ClientTrace.GotFirstResponseByte%22%7D%5B5m%5D)%2Frate(gitlab_pages_httprange_trace_duration_count%7Brequest_stage%3D%22httptrace.ClientTrace.GotFirstResponseByte%22%7D%5B5m%5D))&g0.tab=0&g1.range_input=2h&g1.expr=sum(gitlab_pages_httprange_open_requests)&g1.tab=0&g2.range_input=2h&g2.stacked=1&g2.expr=sum(increase(gitlab_pages_vfs_operations_total%7Boperation%3D%22Open%22%7D%5B5m%5D))%20by%20(vfs_name%2C%20operation)&g2.tab=0

TTFB

We clearly see that TTFB sky rockets once we disabled NFS. In regular operation we expect 25ms, during outage of GCS we were seeing around 20s, but once we disabled it appears that some requests had TTFB close to "a few minutes".

Connections

This created a crazy backlog of 6k open connections to GCS. Our timeout on the connection is 30 minutes.

Summary

This outage showed some aspects to improve in Pages related how Pages handles timeouts, and what timeouts it should use to ensure sane service recovery and reducing amplification pressure due to outage happening elsewhere.

Proposal

Define TTFB to be no longer than 15s. This should ensure that "sane amount" of requests is opened, and system can fast-reject instead of hang.

var httpClient = &http.Client{
	// The longest time the request can be executed
	Timeout: 30 * time.Minute,
	Transport: httptransport.NewTransportWithMetrics(
		"httprange_client",
		metrics.HTTPRangeTraceDuration,
		metrics.HTTPRangeRequestDuration,
		metrics.HTTPRangeRequestsTotal,
	),
}

We should configure the:

TLSHandshakeTimeout time.Duration, aka before being able to write request
ResponseHeaderTimeout time.Duration, aka close to TTFB
ExpectContinueTimeout time.Duration, aka similar to TTFB

Or, just implement that differently as part of RoundTrip of httprange to define a timeout to receiving response (but not yet fully reading it).

More here: https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/

Edited Oct 24, 2020 by Kamil Trzciński