2021-01-19 - Elevated Error rates across fleet, workhorse
Summary
GitLab.com is currently stable, but we are continuing to investigate the root cause of the error spikes which occurred at 15:07 and again at 18:28 UTC. This could be related to a puma issue and we are going engaging more teams to look at the conditions which arose around the times of the errors.
Timeline
All times UTC.
2021-01-19
-
14:02
- A deployment to canary starts https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/429843 with this diff https://gitlab.com/gitlab-org/security/gitlab/compare/d06ebade1b4...1194e3e3c4d -
16:05
- A deployment to prod starts https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/429843 with this diff https://gitlab.com/gitlab-org/security/gitlab/compare/d06ebade1b4...1194e3e3c4d -
17:08
errors start increasing -
17:21
- @alex declares incident in Slack. -
17:26
- web deployment finishes -
17:36
second large error spike -
17:41
error rates return to normal -
18:14
- Deployment finishes. -
18:27
Errors spike again -
18:29
Error rates return to normal
2021-01-20
- 09:00 -
"DISABLE_PUMA_NAKAYOSHI_FORK": "true"
and"ENABLE_PUMA_NAKAYOSHI_FORK": "false"
deployed to cny. - 10:10 -
"DISABLE_PUMA_NAKAYOSHI_FORK": "true"
and"ENABLE_PUMA_NAKAYOSHI_FORK": "false"
deployed to all of gprd.
More information will be added as we investigate the issue.
Corrective Actions
-
Cleanup the "DISABLE_PUMA_NAKAYOSHI_FORK": "true"
and"ENABLE_PUMA_NAKAYOSHI_FORK": "false"
env vars after we got a fix deployed in code, as this is a delta between staging and gprd now (here and here). -
Cleanup elastic cloud watcher, or decide if we want to keep this alert in some way: gitlab-com/runbooks!3135 (merged)
Click to expand or collapse the Incident Review section.
Incident Review
Summary
For a period of (between 2021-01-19 17:08 and 2021-01-19 18:29), GitLab.com experienced a rise in error rates. The underlying cause has been determined as an issue with an upstream library (Hamlit) that was caused by an upstream change in Ruby. This incident was mitigated by enabling an environment variable that disables this new behavior until upstream Hamlit has been patched.
- Service(s) affected: web and api
- Team attribution:
- Time to detection: 13 minutes
- Minutes downtime or degradation: 1.5 hours
Metrics
Web and API graphs showing an increase in error rates:
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- all GitLab.com users
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Users may have received 500 errors or slow requests
-
How many customers were affected?
- There were 82592 errors during this time period compared to the same time on the previous and following week. Of these errors, 26878 came from unique IP addresses.
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- The error rate was about 1%
What were the root causes?
Puma was segfaulting because of memory issues relating to GC.compact
call in Hamlit.
GC.compact
attempts to defragment memory by moving data from one heap slot to another. This helps reduce memory bloat since new heap pages don't need to be allocated if enough contiguous slots are available. Ruby gems that use C extensions have two options:
- Handle these "move" events
- Pin the pointers so that the garbage collector will not attempt to move this data.
In Hamlit, there were a number of statically-defined pointers to constants (e.g. aria
, data
, =
) that were not pinned. A call to rb_const_get()
to lookup these pointers would occasionally fail with a seg fault. While we were not able to see that these pointers had been moved by the garbage collector, the Hamlit maintainer worked around the seg faults by removing rb_const_get()
calls and and pinning the values. This appears to have solved the problem, although there was one remaining rb_const_get()
call that was removed later.
Incident Response Analysis
-
How was the incident detected?
- PagerDuty alerted the EOC to the issue.
-
How could detection time be improved?
- The issue was showing in Sentry before the alert, however we do not alert on Sentry errors.
-
How was the root cause diagnosed?
- We had to go deep into the code and stacktraces to track down where these errors were coming from.
-
How could time to diagnosis be improved?
- This particular issue is quite tricky as it wasn't readily apparent what was even happening. Even after we were able to get dev and others involved, it took a while to narrow the problem down to Hamlet and the underlying Ruby change.
-
How did we reach the point where we knew how to mitigate the impact?
- We had to look into the GitLab code and onward to upstream libraries to figure out the changes before it was determined we could set an environment variable to disable the new, broken behavior.
-
How could time to mitigation be improved?
- This was a very obscure issue that required lots of people to track down. There isn't really a way to improve mitigation time on this kind of deep, upstream issue.
-
What went well?
- We were able to get dev and others involved quickly.
- There was excellent collaboration from everyone in order to drill down very deep into an obscure problem.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Not to my knowledge
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- The issue was made manifest by a deployment, but it was not caused by a deployment.
Lessons Learned
- This was deeper into ruby and its system interaction than SRE was really prepared to troubleshoot. We should consider more training on deeper Ruby knowledge.
- Feature flags would have let us test this change safely.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)