Notes and reflections on Customer Emergency On-Call Rotation: week of 22 February 2021
My first on-call rotation at GitLab was the week of 22 February 2021. It was a very eventful week. In this issue, I wanted to capture my notes and thoughts on the experience in order to:
- provide insight for other Support Engineers who particpate in the Customer Emergency On-Call Rotation
- capture feedback/suggestions on process improvements to make the experience better the next go around
- give me something to refer to later to see how I have grown
Emergencies
It was quite a busy week. Here's a quick summary of the emergencies that I saw come in:
Monday
There was 1 emergency:
|
Problem | Steps taken toward resolution | Notes |
---|---|---|---|
#195564 | disk space: lots consumed, increasing rapidly | manually expire the millions of artifacts created over a period from 2018-present, identify heaviest projects including 1 with scheduled pipelines running too frequently. | This call was multiple hours. emergency Slack thread |
Skills: artifacts
, NFS
, gitlab-rails console
| Tools: GitLab Rails Console
Tuesday
There were 2 emergencies:
|
Problem | Steps taken toward resolution | Notes |
---|---|---|---|
#195802 | Runners failing with 500
|
Apply suggestion in recently opened issue, set endpoint => https://s3.amazonaws.com in the gitlab_rails['object_store']['connection'] section of /etc/gitlab/gitlab.rb
|
This was a known issue later fixed in 13.9.1. Reported as impacting k8s but found on Omnibus in this ticket and in the issue. |
#195876 | Following failed upgrade, did restore, now failing when running CHEF_FIPS=“” sudo gitlab-ctl reconfigure , error: This version of OpenSSL does not support FIPS mode
|
Omit the sudo , you're running as root . Just do CHEF_FIPS=“” gitlab-ctl reconfigure
|
Skills: Linux systems administration
, familarity with recently opened issues
Wednesday
There was 1 emergency:
|
Problem | Steps taken toward resolution | Notes |
---|---|---|---|
#196096 | GitLab inaccessible after upgrade, getting 500 on login |
ID there are down migrations. Run them with sudo gitlab-rake db:migrate , confirm sudo gitlab-rake db:migrate:status returns nothing when the output is piped to grep -v 'down' . Everything's good. |
Use the docs! This was the same instance as Tuesday's emergency in #195876. |
Skills: docs
| Tools: Display status of database migrations Rake tasks
Thursday
There were 0 emergencies. I prepped for #196366 to potentially go to an emergency. That customer filed an emergency to do an RCA on Friday in
Friday
There were 3 emergencies:
|
Problem | Steps taken toward resolution | Notes |
---|---|---|---|
#196595 | No production outage, seeking RCA | The customer from the emergency that we prepped for but didn't see called in filed an emergency ticket requesting assistance performing a root cause analysis. | Without logs, error messages or info about changes, it's hard to say what causes spikes. |
#196637 | instability in instance: Web interface a bit slow, git clone fails w/ 503 , Deadline exceeded
|
80% of traffic from single user: applied rate limiting to authenticated API requests and performance immediately recovered | cell |
#196670 | lots of sshd processes consuming CPU, spiking load to triple digits |
very frequent authentication failure (multiple times per minute) from a few top talkers associated with single organizational unit, blocked top talking IPs and things stabilized. Following up with the team. |
This came in during the call for the emergency shown above. @wchandler kindly continued assisting in that emergency while I joined this one. |
Skills: gitaly
, sshd
| Tools: fast-stats
Saturday
Sunday
Tips
Use the docs.
Our documentation is amazing! There's a very good reason we take a docs-first approach. It's wonderful to have confidence in documentation when you need to rely on it the most. One of the emergencies that I handled came down to:
- The docs weren't followed during an upgrade.
- We found the docs for what to do if you didn't follow those docs and followed them to solve the problem.
Back to basics
For one of the emergencies that I handled: the customer was running CHEF_FIPS="" sudo gitlab-ctl reconfigure
as the user root
. We did CHEF_FIPS="" gitlab-ctl reconfigure
instead and that did the trick.
Things I would do differently
- Avoid taking customer calls during the week of my Customer Emergencies rotation
My on-call shift was 11A-7P in my time zone. I should have updated Calendly at the beginning of the week before my on-call shift to make sure that no customer calls were scheduled during that week at all. While I was not in a customer call during an emergency, I had a customer call just before my on-call shift.
At a minimum: don't be in a customer call when you get paged for an emergency. Ideally: consider not taking any customer calls during the week of you your on-call rotation. Reason: be as fresh and clear in the mind as possible to soak in all context associated with the emergency. Favor quick and easy New tickets, docs updates, professional development activities.
🔑 Keys to Success
- We all already know this but Collaboration and communication are paramount!
- We are a team: ask for help when you need it. I am thankful to a great number of people who provided suggestions and guidance and hopped right onto calls with me.
Suggestions and Process Improvements
- Proposal: count people out of Support Response Crew while they are in the Customer Emergency On-Call Rotation. This gives us a more realistic perspective on how the crew is staffed on any given day. (This week was also one of the weeks I was splitting Support Response Crew work into 2x half-days. I found myself in emergencies on both of the days I was doing crew.)
- Consider reviewing the GitLab Support On-Call Guide on the business day before your on-call shift.
- Consider reducing the number of tickets you Assign to yourself in the week leading up to your time in the Customer Emergency On-Call Rotation.
- Consider using the GitLab Support Emergency runbook.
- Stay hydrated and take a deep breath.