Revisit the Database query Apdex
Introduction
The Database Query Apdex is the Database Group's main Product Performance Indicator, our North Star Metric to track the long term impact of our efforts as a group.
Quoting from the issue that the Database Query Apdex was introduced:
The number of queries per week which meet the defined Service Level Objective (99%) is an important metric, as it represents the number of queries for the instance which were serviced in our defined time range. As additional features and users are added to the system, this count will continue to grow. If the database scales poorly, this count could decrease.
Used in conjunction with a percentage of queries which meet the SLO, this metric will provide good insight into whether the database is adequately serving the needs of the application.
Definition of Database Query Apdex (100ms target, 250ms tolerable) for GitLab.com (link to thanos)
(
sum by (environment) ( sli_aggregations:gitlab_sql_primary_duration_seconds_bucket_rate1h{env="gprd", le="0.1"} )
+
sum by (environment) ( sli_aggregations:gitlab_sql_primary_duration_seconds_bucket_rate1h{env="gprd", le="0.25"} )
)
/
2
/
(
sum by (environment) ( sli_aggregations:gitlab_sql_primary_duration_seconds_bucket_rate1h{env="gprd", le="+Inf"} ) > 0
)
The definition for self managed instances is similar, aggregated over a one week period (gitlab-org/gitlab!39256 (merged))
- Sisense chart - Performance Indicator(NSM): Average Query Apdex per version - All Editions
- Sisense chart - Performance Indicator(NSM): Average Query Apdex per version - CE vs EE
Problem
We have set the target for the Database query Apdex to 0.99
.
This is reasonable for GitLab.com where it is both expected and observed to drop bellow that target only during database related production incidents (Apdex on GitLab.com). Under normal circumstances, the Database query Apdex is above 0.997
for Gitlab.com.
But our experience with self managed instances is that it is very difficult to get above 0.98 in general and 0.97 for instances running GitLab EE.
One reason may be that we calculate a global average counting equally instances that vary widely in size, platform or Database used and resources available. Or they may be other root causes behind those values.
As an additional concern, the Database query Apdex is defined at a level so high that it does not allow us to easily investigate further or take actionable steps towards addressing problems that we may observe. As an example, there was a drop of the Database query Apdex in GitLab.com in October of 2021, which we never were able to correlate with a root cause or explain (gitlab-org/gitlab#343906 (closed)).
Goal
We may have to rethink how we model the Database query Apdex and whether we should introduce some type of clustering or weighted average.
If the existing definition of the Database query Apdex is not adequate to and track the impact of the initiatives from groupdatabase and can not help us to drive forward new initiatives and our roadmap, we may even have to rethink it as our North Star Metric and make a different decision.