Add WAL rate indicator to db health check
What does this MR do and why?
This adds WAL rate as a new indicator to the existing database health check framework and it's behind db_health_check_wal_rate FF - rollout issue.
There are two ways to get the WAL numbers, by querying the database or via Prometheus. We chose using prometheus (thanos) because it's easier to get the time ranged values.
Dependencies
- !128742 (merged) - Renames database_apdex_settings => prometheus_alert_db_indicators_settings.
- !128674 (merged) - Allows prometheus_alert_db_indicators_settings to have additionalProperties.
Script to add wal_rate settings in application_settings:
This is done via change request instead of a migration, so that we will have control over adding ENV specific values in the query string.
application_setting = ApplicationSetting.last
prometheus_alert_db_indicators_settings = application_setting.prometheus_alert_db_indicators_settings
application_setting.update(prometheus_alert_db_indicators_settings: prometheus_alert_db_indicators_settings.merge(
wal_rate_sli_query: {
main: "avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni'})",
ci: "avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni-ci'})"
},
wal_rate_slo: {
main: 70000000,
ci: 70000000
}
))
CR issues:
- gstg: gitlab-com/gl-infra/production#16138 (closed)
- gprd: gitlab-com/gl-infra/production#16191 (closed)
Response from Thanos:
Below commands were run from staging console.
# gstg
[ gstg ] production> client.query("avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni'})")
=> [{"metric"=>{}, "value"=>[1691402826.928, "235689.48166859735"]}]
[ gstg ] production> client.query("avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni-ci'})")
=> [{"metric"=>{}, "value"=>[1691402831.056, "265188.8820589126"]}]
# gprd
[ gstg ] production> client.query("avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gprd', type='patroni'})")
=> [{"metric"=>{}, "value"=>[1691402817.84, "51125421.45258106"]}]
[ gstg ] production> client.query("avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gprd', type='patroni-ci'})")
=> [{"metric"=>{}, "value"=>[1691402836.828, "31577090.206746936"]}]
How to set up and validate locally
Prerequisite: As we need to query Thanos which can't be accessed from our local machine, we have to mock the promQL result in local.
Scenario 1: Signals::NotAvailable
without the required feature flag
Feature.enabled?(:db_health_check_wal_rate, type: :ops)
=> false
context = OpenStruct.new(gitlab_schema: :gitlab_main)
indicator = Gitlab::Database::HealthStatus::Indicators::WalRate.new(context)
indicator.evaluate
#<Gitlab::Database::HealthStatus::Signals::NotAvailable:0x00000001631d5100 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalRate, @reason="indicator disabled">
Scenario 2: Signals::Unknown
on empty prometheus alert settings
Feature.enable(:db_health_check_wal_rate)
application_setting = ApplicationSetting.last
old_prometheus_alert_db_indicators_settings = application_setting.prometheus_alert_db_indicators_settings
application_setting.update(prometheus_alert_db_indicators_settings: nil)
indicator.evaluate
=> #<Gitlab::Database::HealthStatus::Signals::Unknown:0x00000001663051d0 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalRate, @reason="Prometheus Settings not configured">
Scenario 3: Signals::Stop
on WAL rate condition not being met
application_setting.update(prometheus_alert_db_indicators_settings: old_prometheus_alert_db_indicators_settings.merge(
wal_rate_sli_query: {
main: "avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni'})",
ci: "avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni-ci'})"
},
wal_rate_slo: {
main: 70000000,
ci: 70000000
}
))
# Manually change Gitlab::PrometheusClient.ready? to `return true`
# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return a value above 70000000, eg: 80000000
reload!
indicator.evaluate # re-assign variable as needed
=> #<Gitlab::Database::HealthStatus::Signals::Stop:0x0000000162f47e10 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalRate, @reason="WalRate SLI condition not met">
Scenario 4: Signals::Normal
on WAL rate condition being met
# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return a value below 70000000, eg: 50000000
reload!
indicator.evaluate # re-assign variable as needed
=> #<Gitlab::Database::HealthStatus::Signals::Normal:0x0000000164087f18 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalRate, @reason="WalRate SLI condition met">
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #357251 (closed)