RCA: Redis HLL monthly metrics outage in Usage Ping for Weeks 1, 2, 3, 4 of 2021

Summary

We had a temporary outage of Redis HLL monthly metrics in Usage Ping during weeks 1, 2, 3, 4 of 2021 on GitLab versions 13.4, 13,5, 13.6, 13.7. This will continue to work as normal during weeks 5-52 of 2021 even if the GitLab instance does not upgrade to the latest version. We have also released a patch fix gitlab-org/gitlab!50358 (merged) to resolve this issue on future versions of GitLab.

Redis HLL monthly counters encountered an error as we rolled over the year from 2020 into 2021. The error was due to how we calculate monthly metrics - we calculate monthly metrics by taking the union of the last four weekly counters. As we rolled over the year, retrieving "the last four weekly counters" resulted in an error as we did not have logic to handle the case where the week number of current year is smaller than the week number of the previous year.

For example, if it the current week is 52, we would retrieve the last four weeks as 51, 50, 49, 48 and union the result. When we roll over to the new year, if the current week is 3, we would retrieve weeks the last four weeks as 2, 1, 0, -1 instead of 2 of 2021, 1 of 2021, 52 of 2020, 51 of 2020 - this logic resulted in throwing an error causing the counter to fail.

This outage only impacted Redis HLL monthly metrics on versions 13.4, 13,5, 13.6, 13.7. Redis HLL weekly metrics were not impacted and other Usage Ping metrics continue to operate as normal.

One thing to note is that the investments we previously made in Usage Ping hardening and Redis HLL hardening helped catch and isolate this error in Production. As a result, this outage did not cause the entire Usage Ping to fail and was isolated to only the impacted counter (Redis HLL monthly).

Service(s) affected : Self-managed instances having release version 13.4, 13,5, 13.6, 13.7
Team attribution : Product, Engineering
Minutes downtime or degradation : Weeks 1, 2, 3, 4 of 2021.

Impact & Metrics

Question	Answer
What was the impact	Missing data for Redis HlL counters first 4 weeks of a new year
Who was impacted	Self-managed instances `13.4`, `13,5`, `13.6`, `13.7`
How did this impact customers	No end customer impact. The only impacted internal teams as we were unable to collect product usage data from Redis HLL sources.
How many attempts made to access	0
How many customers affected	0
How many customers tried to access	0

Detection & Response

Start with the following:

Question	Answer
When was the incident detected?	2020-12-21 1:31AM UTC
How was the incident detected?	Broken master, failing tests
Did alarming work as expected?	Partially. We had a failing test catch this, however, the test only failed once we the run date of the test approached the new year
How long did it take from the start of the incident to its detection?	5 hours 2020-12-21 6:32AM UTC Added first note for the potential bug
How long did it take from detection to remediation?	11 hours 12 min until merge to master [2020-12-21 12:43UTC
What steps were taken to remediate?	MR with fix, patch release 13.7
Were there any issues with the response?	Issue created

Timeline

2020-12-21

2020-12-21 1:31 UTC Broken master, failing tests
2020-12-21 4:14 UTC @caalberts opens issue gitlab-org/gitlab#295173 (closed)
2020-12-21 4:55 UTC @caalberts mention @alinamihaila and @iamricecake for the potential bug
2020-12-21 8:02 UTC @alinamihaila communicates message is #g_product_intelligence
2020-12-21 9:08 UTC @jeromezng sets the priority1 severity1
2020-12-21 9:50 UTC Pair programming session for reviewing the fix @alinamihaila @mikolaj_wawrzyniak @a_akgun
2020-12-21 12:43 UTC Fix MR merged to master applied patch to 13.7

Root Cause Analysis

The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.

For this it is necessary to start with the incident, and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.

Keep in min that from one "why?" there may come more than one answer, consider following the different branches.

"5 whys"

Why? - Weekly Redis keys computed incorrect for beginning of the year
Why? - Redis HLL unique_events not fully hardened and captured in a rescue with a -1 fallback
Why? - Redis counters not fully hardened and captured in a rescue with a -1 fallback
Why? - Our usage data framework doesn't have an error capturing system
Why? - ...

What went well

The investments we previously made in Usage Ping hardening and Redis HLL hardening helped catch and isolate this error in Production. As a result, this outage did not cause the entire Usage Ping to fail and was isolated to only the impacted counter (Redis HLL monthly).
We followed our process for resolving S1/P1 issues and had a very quick and collaborative response and resolution from our team.

What can be improved

Improve hardening for any metric we add in Usage Ping
Prioritize refactoring on Redis HLL in order to adjust new requirements
Implement an error capturing system
Build Usage ping Health Dashboard, with alerts

Corrective actions

Immediate:

Fix Redis HLL weekly keys gitlab-org/gitlab!50358 (merged)

Short-Term:

Specifically improve hardening for Redis HLL counters gitlab-org/gitlab#295181 (closed)
Improve hardening for any metric we add in Usage Ping gitlab-org/gitlab#295268 (closed)

Long-Term:

Move to independent jobs for each Usage Ping metric as a part of Usage Ping 2.0 gitlab-org&5155 (closed)

Guidelines

Edited Jan 26, 2021 by Jerome Z Ng