gitlab.net zone has elevated HTTP 5xx error rate

TLDR

We think that (a) the planned downtime for Kibana and (b) the saturation of the new release.gitlab.net domain combined to push the error rate above the alerting threshold for the whole gitlab.net Cloudflare zone.

The degradation of both of those services was known prior to this alert triggering, but we did not realize that those two services being down would trigger this alert.

Summary

The gitlab.net zone in Cloudflare includes many subdomains, and this alert triggered because they collectively exceeded our configured error rate. We think the following 2 sources of errors were probably the main contributing factors to elevating the error rate:

log.gprd.gitlab.net (our production Kibana frontend) was down for maintenance.
release.gitlab.net (a new site being hit by QA) needed a larger capacity.

Kibana is now back up, and capacity has been added to release.gitlab.net. The rate of HTTP 522 errors from Cloudflare's perspective stopped right after we grew the instance backing release.gitlab.net.

We did not ingest the Cloudflare logs to directly determine which subdomains and request paths were resulting in HTTP response codes that are considered errors (e.g. HTTP 400-599).

PagerDuty alert: https://gitlab.pagerduty.com/incidents/P5GSFVN

Grafana dashboard: The multiple-window SLO dashboard shows the composite metric underlying the alert, as well as the underlying metrics. Essentially, it indicates that our 5-minute and 30-minute error rates exceeded their alerting thresholds

Timeline

All times UTC.

2020-06-30

17:24 - 1st PagerDuty alert: https://gitlab.pagerduty.com/incidents/P7GTONF
17:29 - 1st PagerDuty alert self-resolved.
17:35 - Found that Cloudflare's gitlab.net zone started showing spikes in rate of HTTP 522 "Origin Connection Time-out", HTTP 524 "Origin Time-out", and HTTP 520 "Origin Error". Because the alert had resolved, EOC shifted focus back to other unrelated alerts.
17:41 - 2nd PagerDuty alert: https://gitlab.pagerduty.com/incidents/P5GSFVN
17:51 - @msmiley declares incident in Slack using /incident declare command.
17:54 - Incident zoom with @skarbek and @AnthonySandoval. We pooled our situational knowledge and determined that the Kibana downtime and the saturation of release.gitlab.net were probably both contributing to the collective error rate for *.gitlab.net.
18:10 - Rate of HTTP 522 "Origin Connection Time-out" errors drops to 0. This correlates well with @skarbek growing the VM backing the new release.gitlab.net which QA is using for testing: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1881 At this point, we considered the incident resolved.
18:41 - 2nd PagerDuty alert resolved. The 31-minute lag before the alert self-resolved was presumably due to the 30-minute burn-rate taking 30 minutes for its window to scroll passed the trailing edge of the errors-per-minute histogram.

Graphs

The composite metric that triggered the alert

This graph shows a composite metric defined here that essentially alerts when either:

the 5-minute and 1-hour mean error rate violates SLO for at least 2 minutes, or
the 30-minute and 6-hour mean error rate violates SLO for at least 2 minutes

Grafana dashboard for this event: https://dashboards.gitlab.net/d/alerts-component_multiburn_error/alerts-component-multi-window-multi-burn-rate-out-of-slo?orgId=1&from=1593535200000&to=1593542400000&var-environment=gprd&var-type=waf&var-stage=main&var-component=gitlab_net_zone

This graph shows the whole duration of the event. This style of alert is sadly not easy to interpret unless you have prior experience with it. In a pinch, the important aspects are:

Scope: The type, component, and stage labels are mentioned in the alert text and shown as filters on the dashboard. In this case:
- type=waf indicates Cloudflare (which includes a web application firewall among its features)
- component=gitlab_net_zone indicates Cloudflare's gitlab.net zone (as opposed to its gitlab.com zone). Importantly, this clue tells us that gitlab.com is out of scope for the alert.
- stage=main indicates the alert is not talking about canary (which in this context has no meaning but for different types of service can be a useful differentiator).
Severity:
- The alert only triggers when both a short duration and a medium duration burn rate are concurrently exceeding the defined alerting threshold. The graph shows this pair of burn rates together with the alerting threshold, so we can compare them. This is the "multi-burn-rate" aspect of the graph.
- Also, we want more than one time scale for this multi-burn-rate behavior. So we repeat that pattern with a second pair of burn-rates (and a different alerting threshold). This is the "multi-window" aspect of the graph.
- The first window (green lines in the graph) alerts when both the 5-minute and 1-hour burn rates (i.e. mean error rate) exceed 1.44% of the overall request rate.
- The second window (blue lines in the graph) alerts when both the 30-minute and 6-hour burn rates (i.e. mean error rates) exceed 0.60% of the overall request rate.

Start of the problem

As noted above, the alert indicated Cloudflare zone gitlab.net had a high ratio of errors to successful requests. That error ratio is summed over all subdomains, and it does not clearly define what counts as an "error".

I'm guessing that this refers to HTTP traffic (rather than, say, failed SSH connections via Cloudflare's Spectrum service). Further, I'm guessing that the definition of an error follows the common convention of using HTTP response codes and treats the range of HTTP 400-599 as an "error" event. Cloudflare has an analytics dashboard that summarizes the timeline of recent HTTP requests tallied by HTTP response code. Looking at the HTTP responses in the range of 500-599 turned up the following lead:

As of 17:35 UTC, Cloudflare's Analytics dashboard showed the following elevated rate of HTTP requests having HTTP 5xx response codes. Note: The graph's time zone is UTC-7 (so 10:15 PDT = 17:15 UTC).

End of the problem

As of 18:10 UTC, Cloudflare's Analytics dashboard showed the rate of HTTP 5xx errors drop back to zero for the gitlab.net zone. Note: The graph's time zone is UTC-7 (so 11:10 PDT = 18:10 UTC).

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected:
Team attribution:
Minutes downtime or degradation:

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
How many customers were affected?
If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

How was the event detected?
How could detection time be improved?
How did we reach the point where we knew how to mitigate the impact?
How could time to mitigation be improved?

Post Incident Analysis

How was the root cause diagnosed?
How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

gitlab.net zone has elevated HTTP 5xx error rate

TLDR

Summary

Timeline

Graphs

The composite metric that triggered the alert

Start of the problem

End of the problem

Incident Review

Summary

Metrics

Customer Impact

Incident Response Analysis

Post Incident Analysis

5 Whys

Lessons Learned

Corrective Actions

Guidelines