RCA: Redis HLL monthly metrics outage in Usage Ping for Weeks 1, 2, 3, 4 of 2021
Summary
We had a temporary outage of Redis HLL monthly metrics in Usage Ping during weeks 1, 2, 3, 4 of 2021 on GitLab versions 13.4
, 13,5
, 13.6
, 13.7
. This will continue to work as normal during weeks 5-52 of 2021 even if the GitLab instance does not upgrade to the latest version. We have also released a patch fix gitlab-org/gitlab!50358 (merged) to resolve this issue on future versions of GitLab.
Redis HLL monthly counters encountered an error as we rolled over the year from 2020 into 2021. The error was due to how we calculate monthly metrics - we calculate monthly metrics by taking the union of the last four weekly counters. As we rolled over the year, retrieving "the last four weekly counters" resulted in an error as we did not have logic to handle the case where the week number of current year is smaller than the week number of the previous year.
For example, if it the current week is 52, we would retrieve the last four weeks as 51, 50, 49, 48 and union the result. When we roll over to the new year, if the current week is 3, we would retrieve weeks the last four weeks as 2, 1, 0, -1 instead of 2 of 2021, 1 of 2021, 52 of 2020, 51 of 2020 - this logic resulted in throwing an error causing the counter to fail.
This outage only impacted Redis HLL monthly metrics on versions 13.4
, 13,5
, 13.6
, 13.7
. Redis HLL weekly metrics were not impacted and other Usage Ping metrics continue to operate as normal.
One thing to note is that the investments we previously made in Usage Ping hardening and Redis HLL hardening helped catch and isolate this error in Production. As a result, this outage did not cause the entire Usage Ping to fail and was isolated to only the impacted counter (Redis HLL monthly).
- Service(s) affected : Self-managed instances having release version
13.4
,13,5
,13.6
,13.7
- Team attribution : Product, Engineering
- Minutes downtime or degradation : Weeks 1, 2, 3, 4 of 2021.
Impact & Metrics
Question | Answer |
---|---|
What was the impact | Missing data for Redis HlL counters first 4 weeks of a new year |
Who was impacted | Self-managed instances 13.4 , 13,5 , 13.6 , 13.7
|
How did this impact customers | No end customer impact. The only impacted internal teams as we were unable to collect product usage data from Redis HLL sources. |
How many attempts made to access | 0 |
How many customers affected | 0 |
How many customers tried to access | 0 |
Detection & Response
Start with the following:
Question | Answer |
---|---|
When was the incident detected? | 2020-12-21 1:31AM UTC |
How was the incident detected? | Broken master, failing tests |
Did alarming work as expected? | Partially. We had a failing test catch this, however, the test only failed once we the run date of the test approached the new year |
How long did it take from the start of the incident to its detection? | 5 hours 2020-12-21 6:32AM UTC Added first note for the potential bug |
How long did it take from detection to remediation? | 11 hours 12 min until merge to master [2020-12-21 12:43UTC |
What steps were taken to remediate? | MR with fix, patch release 13.7 |
Were there any issues with the response? | Issue created |
Timeline
2020-12-21
- 2020-12-21 1:31 UTC Broken master, failing tests
- 2020-12-21 4:14 UTC
@caalberts
opens issue gitlab-org/gitlab#295173 (closed) - 2020-12-21 4:55 UTC
@caalberts
mention@alinamihaila
and@iamricecake
for the potential bug - 2020-12-21 8:02 UTC
@alinamihaila
communicates message is #g_product_intelligence - 2020-12-21 9:08 UTC
@jeromezng
sets the priority1 severity1 - 2020-12-21 9:50 UTC Pair programming session for reviewing the fix @alinamihaila @mikolaj_wawrzyniak @a_akgun
- 2020-12-21 12:43 UTC Fix MR merged to master applied patch to 13.7
Root Cause Analysis
The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.
For this it is necessary to start with the incident, and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.
Keep in min that from one "why?" there may come more than one answer, consider following the different branches.
"5 whys"
- Why? - Weekly Redis keys computed incorrect for beginning of the year
- Why? - Redis HLL
unique_events
not fully hardened and captured in a rescue with a -1 fallback - Why? - Redis counters not fully hardened and captured in a rescue with a -1 fallback
- Why? - Our usage data framework doesn't have an error capturing system
- Why? - ...
What went well
- The investments we previously made in Usage Ping hardening and Redis HLL hardening helped catch and isolate this error in Production. As a result, this outage did not cause the entire Usage Ping to fail and was isolated to only the impacted counter (Redis HLL monthly).
- We followed our process for resolving S1/P1 issues and had a very quick and collaborative response and resolution from our team.
What can be improved
- Improve hardening for any metric we add in Usage Ping
- Prioritize refactoring on Redis HLL in order to adjust new requirements
- Implement an error capturing system
- Build Usage ping Health Dashboard, with alerts
Corrective actions
Immediate:
- Fix Redis HLL weekly keys gitlab-org/gitlab!50358 (merged)
Short-Term:
- Specifically improve hardening for Redis HLL counters gitlab-org/gitlab#295181 (closed)
- Improve hardening for any metric we add in Usage Ping gitlab-org/gitlab#295268 (closed)
Long-Term:
- Move to independent jobs for each Usage Ping metric as a part of Usage Ping 2.0 gitlab-org&5155 (closed)