2020-06-16: Python dependency not installed in our CNG images prevents some use of RST file rendering
Summary
The rendering engine utilized by RST files shells out to python. We embed python in our omnibus installations, but this is not the case for our containers created for Kubernetes installations. We discovered an error in Sentry that led to the discovery that calls made by ruby inside of sidekiq were failing. gitlab-org/gitlab#222637 (moved)
python not installed on sidekiq Pods
Without python installed on our container images, some functionality which depends on python is currently broken.
Timeline
All times UTC.
2020-06-09
- This is the date of when we started to run the
urgent-other
queue inside of Kubernetes
2020-06-16
- 14:43 - Engineer notices errors spawning from our Kubernetes infrastructure for a specific files that are rendered: https://sentry.gitlab.net/gitlab/gitlabcom/issues/1646755/
- 15:02 - This error is identified as a P1/S1
- 15:13 - skarbek starts the procedure of bringing online our sidekiq VM's for the
urget-other
shard - 15:20 - final sentry error reported
- 15:42 - skarbek declares incident in Slack using
/incident declare
command. - 15:53 - VM's are validated to be online and working - Pods running this shard are turned down to minimize errors
- 15:54 - incident is declared as remediated
Incident Review
Summary
- Service(s) affected: ServiceSidekiq
- Team attribution: groupdistribution
- Minutes downtime or degradation: Degraded between June 9th - June 16th
Metrics
https://log.gprd.gitlab.net/goto/e628aa51ead5b0ddd017e2d292bf54de
Customer Impact
- Who was impacted by this incident? Any user may be impacted when an RST file is rendered
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Yes, some users were served an HTTP500 status when requesting these files
- How many customers were affected? 101 Unique user accounts Source
- If a precise customer impact number is unknown, what is the estimated potential impact? 845 errors reported by Sentry; 229 Errors related
Incident Response Analysis
- How was the event detected? Engineer saw error reported in Sentry
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
Root cause found as Sentry reported the server responsible for running the shell out to Python. A quick look at the Docker Image being utilized revealed that python is not installed into the image.
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
Change issue associated with the shard migration to Kubernetes: #2254 (closed)
Epic covering the migration of this effort: &256 (closed)
5 Whys
Lessons Learned
- TODO: Get link of our markdown interpreter for documentation purposes
- After moving the shard back to VM's we lost metrics for roughly 30 minutes
- This was due to our use of chef-search to populate our Prometheus nodes for scraping
- This would require two chef runs, one of the sidekiq node itself to complete its registration with chef, followed by a run on each of the Prometheus nodes to reconfigure them to scrape the new nodes
Corrective Actions
- Can we move our markdown interpreter from python to ruby?
- Can we make any improvements to QA to capture this failure scenario?
- The thought here is that we could've run the migration of this shard in staging, executed QA and seen this error and failed QA jobs.
- If possible this would have prevented us from moving the shard and investigate the errors prior to seeing this show up in production
- gitlab-org/gitlab#225190 (closed)