2020-06-16: Python dependency not installed in our CNG images prevents some use of RST file rendering

Summary

The rendering engine utilized by RST files shells out to python. We embed python in our omnibus installations, but this is not the case for our containers created for Kubernetes installations. We discovered an error in Sentry that led to the discovery that calls made by ruby inside of sidekiq were failing. gitlab-org/gitlab#222637 (moved)

python not installed on sidekiq Pods

Without python installed on our container images, some functionality which depends on python is currently broken.

Timeline

All times UTC.

2020-06-09

This is the date of when we started to run the urgent-other queue inside of Kubernetes

2020-06-16

14:43 - Engineer notices errors spawning from our Kubernetes infrastructure for a specific files that are rendered: https://sentry.gitlab.net/gitlab/gitlabcom/issues/1646755/
15:02 - This error is identified as a P1/S1
15:13 - skarbek starts the procedure of bringing online our sidekiq VM's for the urget-other shard
15:20 - final sentry error reported
15:42 - skarbek declares incident in Slack using /incident declare command.
15:53 - VM's are validated to be online and working - Pods running this shard are turned down to minimize errors
15:54 - incident is declared as remediated

Incident Review

Summary

Service(s) affected: ServiceSidekiq
Team attribution: groupdistribution
Minutes downtime or degradation: Degraded between June 9th - June 16th

Metrics

https://log.gprd.gitlab.net/goto/e628aa51ead5b0ddd017e2d292bf54de

Customer Impact

Who was impacted by this incident? Any user may be impacted when an RST file is rendered
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Yes, some users were served an HTTP500 status when requesting these files
How many customers were affected? 101 Unique user accounts Source
If a precise customer impact number is unknown, what is the estimated potential impact? 845 errors reported by Sentry; 229 Errors related

Incident Response Analysis

How was the event detected? Engineer saw error reported in Sentry
How could detection time be improved?
How did we reach the point where we knew how to mitigate the impact?
How could time to mitigation be improved?

Post Incident Analysis

How was the root cause diagnosed?

Root cause found as Sentry reported the server responsible for running the shell out to Python. A quick look at the Docker Image being utilized revealed that python is not installed into the image.

How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

Change issue associated with the shard migration to Kubernetes: #2254 (closed)

Epic covering the migration of this effort: &256 (closed)

5 Whys

Lessons Learned

TODO: Get link of our markdown interpreter for documentation purposes
After moving the shard back to VM's we lost metrics for roughly 30 minutes
- This was due to our use of chef-search to populate our Prometheus nodes for scraping
- This would require two chef runs, one of the sidekiq node itself to complete its registration with chef, followed by a run on each of the Prometheus nodes to reconfigure them to scrape the new nodes

Corrective Actions

Can we move our markdown interpreter from python to ruby?
Can we make any improvements to QA to capture this failure scenario?
- The thought here is that we could've run the migration of this shard in staging, executed QA and seen this error and failed QA jobs.
- If possible this would have prevented us from moving the shard and investigate the errors prior to seeing this show up in production
- gitlab-org/gitlab#225190 (closed)

Guidelines

Blameless RCA Guideline

Edited Aug 04, 2020 by AnthonySandoval