RCA: Dramatic usage ping speed improvement
Summary
Weekly usage ping generation for gitlab.com was taking over around 40hours, until 2021 Feb 5th. On that day it started to take ~10 hours, and since February 5th it takes between 8 hours to 10hours only. There's a 5x speed boost, but we don't know the root cause.
Service(s) affected : Background job usage ping generation Team attribution : database & product intelligence & infra Minutes downtime or degradation : There's an improvement of 5x in usage ping speed.
Usage ping executes over 500 SQL queries in batches in the primary database.
Currently product intelligence team members generate the usage ping on production rails console. See the steps https://gitlab.com/gitlab-org/gitlab/-/issues/325248#how-to-generate-usage-ping-for-gtilabcom
bin/rake gitlab:usage_data:dump_sql_in_yaml
Impact & Metrics
Start with the following:
Question | Answer |
---|---|
What was the impact | We had a 5x boost in speed of usage ping generation |
Who was impacted | Gitlab.com primary |
How did this impact customers | Positively as less load on their servers too |
Detection & Response
Start with the following:
Question | Answer |
---|---|
When was the incident detected? | 2021-02-05 first ever |
How was the incident detected? | In usage ping monitoring issue https://gitlab.com/gitlab-org/gitlab/-/issues/298366#note_502413019 |
Did alarming work as expected? | Monitoring worked |
Timeline
- Till 2021-02-05 - Usage ping background work took ~40hours, getting slower and slower.
- Till 2021-02-05 - It took 10 hours
- Afterwards it takes 8 to 10 hours
Root Cause Analysis
Some possible explanations
-
Some challenging queries might time out, which will reduce the total duration, at the expense of failing queries. See "monitor usage ping" issues, https://gitlab.com/gitlab-org/gitlab/-/issues/298366 and we don't have dramatic change in failed metrics
-
Some infrastructure/database optimization
-
Instead of direct rails console access -
ssh aakgun-rails@console-01-sv-gprd.
we go throughssh -A lb-bastion.gprd.gitlab.com
, withscreen -x
But could that change anything?
Example of the usage of "5 whys"
Usage ping generation took a long time.
- Why? - The battery is dead.
- Why? - The alternator is not functioning.
- Why? - The alternator belt has broken.
- Why? - The alternator belt was well beyond its useful service life and not replaced.
- Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)