2021-01-19 - Elevated Error rates across fleet, workhorse

Summary

GitLab.com is currently stable, but we are continuing to investigate the root cause of the error spikes which occurred at 15:07 and again at 18:28 UTC. This could be related to a puma issue and we are going engaging more teams to look at the conditions which arose around the times of the errors.

Timeline

All times UTC.

2021-01-19

14:02 - A deployment to canary starts https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/429843 with this diff https://gitlab.com/gitlab-org/security/gitlab/compare/d06ebade1b4...1194e3e3c4d
16:05 - A deployment to prod starts https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/429843 with this diff https://gitlab.com/gitlab-org/security/gitlab/compare/d06ebade1b4...1194e3e3c4d
17:08 errors start increasing
17:21 - @alex declares incident in Slack.
17:26 - web deployment finishes
17:36 second large error spike
17:41 error rates return to normal
18:14- Deployment finishes.
18:27 Errors spike again
18:29 Error rates return to normal

2021-01-20

09:00 - "DISABLE_PUMA_NAKAYOSHI_FORK": "true" and "ENABLE_PUMA_NAKAYOSHI_FORK": "false" deployed to cny.
10:10 - "DISABLE_PUMA_NAKAYOSHI_FORK": "true" and "ENABLE_PUMA_NAKAYOSHI_FORK": "false" deployed to all of gprd.

More information will be added as we investigate the issue.

Corrective Actions

Cleanup the "DISABLE_PUMA_NAKAYOSHI_FORK": "true" and "ENABLE_PUMA_NAKAYOSHI_FORK": "false" env vars after we got a fix deployed in code, as this is a delta between staging and gprd now (here and here).
Cleanup elastic cloud watcher, or decide if we want to keep this alert in some way: gitlab-com/runbooks!3135 (merged)

Click to expand or collapse the Incident Review section.

Incident Review

Summary

For a period of (between 2021-01-19 17:08 and 2021-01-19 18:29), GitLab.com experienced a rise in error rates. The underlying cause has been determined as an issue with an upstream library (Hamlit) that was caused by an upstream change in Ruby. This incident was mitigated by enabling an environment variable that disables this new behavior until upstream Hamlit has been patched.

Service(s) affected: web and api
Team attribution:
Time to detection: 13 minutes
Minutes downtime or degradation: 1.5 hours

Metrics

Web and API graphs showing an increase in error rates:

Web:

API:

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. all GitLab.com users
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Users may have received 500 errors or slow requests
How many customers were affected?
1. There were 82592 errors during this time period compared to the same time on the previous and following week. Of these errors, 26878 came from unique IP addresses.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. The error rate was about 1%

What were the root causes?

"5 Whys"

Puma was segfaulting because of memory issues relating to GC.compact call in Hamlit.

GC.compact attempts to defragment memory by moving data from one heap slot to another. This helps reduce memory bloat since new heap pages don't need to be allocated if enough contiguous slots are available. Ruby gems that use C extensions have two options:

Handle these "move" events
Pin the pointers so that the garbage collector will not attempt to move this data.

In Hamlit, there were a number of statically-defined pointers to constants (e.g. aria, data, =) that were not pinned. A call to rb_const_get() to lookup these pointers would occasionally fail with a seg fault. While we were not able to see that these pointers had been moved by the garbage collector, the Hamlit maintainer worked around the seg faults by removing rb_const_get() calls and and pinning the values. This appears to have solved the problem, although there was one remaining rb_const_get() call that was removed later.

Incident Response Analysis

How was the incident detected?
1. PagerDuty alerted the EOC to the issue.
How could detection time be improved?
1. The issue was showing in Sentry before the alert, however we do not alert on Sentry errors.
How was the root cause diagnosed?
1. We had to go deep into the code and stacktraces to track down where these errors were coming from.
How could time to diagnosis be improved?
1. This particular issue is quite tricky as it wasn't readily apparent what was even happening. Even after we were able to get dev and others involved, it took a while to narrow the problem down to Hamlet and the underlying Ruby change.
How did we reach the point where we knew how to mitigate the impact?
1. We had to look into the GitLab code and onward to upstream libraries to figure out the changes before it was determined we could set an environment variable to disable the new, broken behavior.
How could time to mitigation be improved?
1. This was a very obscure issue that required lots of people to track down. There isn't really a way to improve mitigation time on this kind of deep, upstream issue.
What went well?
1. We were able to get dev and others involved quickly.
2. There was excellent collaboration from everyone in order to drill down very deep into an obscure problem.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Not to my knowledge
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. The issue was made manifest by a deployment, but it was not caused by a deployment.

Lessons Learned

This was deeper into ruby and its system interaction than SRE was really prepared to troubleshoot. We should consider more training on deeper Ruby knowledge.
Feature flags would have let us test this change safely.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Feb 09, 2021 by Alex Hanselka