Notes and reflections on Customer Emergency On-Call Rotation: week of 22 February 2021

My first on-call rotation at GitLab was the week of 22 February 2021. It was a very eventful week. In this issue, I wanted to capture my notes and thoughts on the experience in order to:

provide insight for other Support Engineers who particpate in the Customer Emergency On-Call Rotation
capture feedback/suggestions on process improvements to make the experience better the next go around
give me something to refer to later to see how I have grown

Emergencies

It was quite a busy week. Here's a quick summary of the emergencies that I saw come in:

Monday

There was 1 emergency:

🎫 Ticket	Problem	Steps taken toward resolution	Notes
#195564	disk space: lots consumed, increasing rapidly	manually expire the millions of artifacts created over a period from 2018-present, identify heaviest projects including 1 with scheduled pipelines running too frequently.	This call was multiple hours. emergency Slack thread

Skills: artifacts, NFS, gitlab-rails console | Tools: GitLab Rails Console

Tuesday

There were 2 emergencies:

🎫 Ticket	Problem	Steps taken toward resolution	Notes
#195802	Runners failing with `500`	Apply suggestion in recently opened issue, set `endpoint => https://s3.amazonaws.com` in the `gitlab_rails['object_store']['connection']` section of `/etc/gitlab/gitlab.rb`	This was a known issue later fixed in 13.9.1. Reported as impacting `k8s` but found on Omnibus in this ticket and in the issue.
#195876	Following failed upgrade, did restore, now failing when running `CHEF_FIPS=“” sudo gitlab-ctl reconfigure`, error: `This version of OpenSSL does not support FIPS mode`	Omit the `sudo`, you're running as `root`. Just do `CHEF_FIPS=“” gitlab-ctl reconfigure`

Skills: Linux systems administration, familarity with recently opened issues

Wednesday

There was 1 emergency:

🎫 Ticket	Problem	Steps taken toward resolution	Notes
#196096	GitLab inaccessible after upgrade, getting `500` on login	ID there are down migrations. Run them with `sudo gitlab-rake db:migrate`, confirm `sudo gitlab-rake db:migrate:status` returns nothing when the output is piped to `grep -v 'down'`. Everything's good.	Use the docs! This was the same instance as Tuesday's emergency in #195876.

Skills: docs | Tools: Display status of database migrations Rake tasks

Thursday

There were 0 emergencies. I prepped for #196366 to potentially go to an emergency. That customer filed an emergency to do an RCA on Friday in 🎫 #196595.

Friday

There were 3 emergencies:

🎫 Ticket	Problem	Steps taken toward resolution	Notes
#196595	No production outage, seeking RCA	The customer from the emergency that we prepped for but didn't see called in filed an emergency ticket requesting assistance performing a root cause analysis.	Without logs, error messages or info about changes, it's hard to say what causes spikes.
#196637	instability in instance: Web interface a bit slow, `git clone` fails w/ `503`, `Deadline exceeded`	80% of traffic from single user: applied rate limiting to authenticated API requests and performance immediately recovered	cell
#196670	lots of `sshd` processes consuming CPU, spiking load to triple digits	very frequent `authentication failure` (multiple times per minute) from a few top talkers associated with single organizational unit, blocked top talking IPs and things stabilized. Following up with the team.	This came in during the call for the emergency shown above. @wchandler kindly continued assisting in that emergency while I joined this one.

Skills: gitaly, sshd | Tools: fast-stats

Saturday

Sunday

Tips

Use the docs.

Our documentation is amazing! There's a very good reason we take a docs-first approach. It's wonderful to have confidence in documentation when you need to rely on it the most. One of the emergencies that I handled came down to:

The docs weren't followed during an upgrade.
We found the docs for what to do if you didn't follow those docs and followed them to solve the problem.

Back to basics

For one of the emergencies that I handled: the customer was running CHEF_FIPS="" sudo gitlab-ctl reconfigure as the user root. We did CHEF_FIPS="" gitlab-ctl reconfigure instead and that did the trick.

Things I would do differently

Avoid taking customer calls during the week of my Customer Emergencies rotation

My on-call shift was 11A-7P in my time zone. I should have updated Calendly at the beginning of the week before my on-call shift to make sure that no customer calls were scheduled during that week at all. While I was not in a customer call during an emergency, I had a customer call just before my on-call shift.

At a minimum: don't be in a customer call when you get paged for an emergency. Ideally: consider not taking any customer calls during the week of you your on-call rotation. Reason: be as fresh and clear in the mind as possible to soak in all context associated with the emergency. Favor quick and easy New tickets, docs updates, professional development activities.

🔑 Keys to Success

We all already know this but Collaboration and communication are paramount!
We are a team: ask for help when you need it. I am thankful to a great number of people who provided suggestions and guidance and hopped right onto calls with me.

Suggestions and Process Improvements

Proposal: count people out of Support Response Crew while they are in the Customer Emergency On-Call Rotation. This gives us a more realistic perspective on how the crew is staffed on any given day. (This week was also one of the weeks I was splitting Support Response Crew work into 2x half-days. I found myself in emergencies on both of the days I was doing crew.)
Consider reviewing the GitLab Support On-Call Guide on the business day before your on-call shift.
Consider reducing the number of tickets you Assign to yourself in the week leading up to your time in the Customer Emergency On-Call Rotation.
Consider using the GitLab Support Emergency runbook.
Stay hydrated and take a deep breath.

Edited Feb 27, 2021 by Brie Carranza