Add WAL rate indicator to db health check (!128365) · Merge requests · GitLab.org / GitLab

Prabakaran Murugesan requested to merge 357251-db-throttle-by-wal-segments-rate into master Aug 03, 2023

What does this MR do and why?

This adds WAL rate as a new indicator to the existing database health check framework and it's behind db_health_check_wal_rate FF - rollout issue.

There are two ways to get the WAL numbers, by querying the database or via Prometheus. We chose using prometheus (thanos) because it's easier to get the time ranged values.

Dependencies

!128742 (merged) - Renames database_apdex_settings => prometheus_alert_db_indicators_settings.
!128674 (merged) - Allows prometheus_alert_db_indicators_settings to have additionalProperties.

Script to add wal_rate settings in application_settings:

This is done via change request instead of a migration, so that we will have control over adding ENV specific values in the query string.

application_setting = ApplicationSetting.last
prometheus_alert_db_indicators_settings = application_setting.prometheus_alert_db_indicators_settings

application_setting.update(prometheus_alert_db_indicators_settings: prometheus_alert_db_indicators_settings.merge(
  wal_rate_sli_query: {
    main: "avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni'})",
    ci: "avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni-ci'})"
  },
  wal_rate_slo: {
    main: 70000000,
    ci: 70000000
  }
))

CR issues:

Response from Thanos:

Below commands were run from staging console.

# gstg

[ gstg ] production> client.query("avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni'})")
=> [{"metric"=>{}, "value"=>[1691402826.928, "235689.48166859735"]}]

[ gstg ] production> client.query("avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni-ci'})")
=> [{"metric"=>{}, "value"=>[1691402831.056, "265188.8820589126"]}]

# gprd

[ gstg ] production> client.query("avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gprd', type='patroni'})")
=> [{"metric"=>{}, "value"=>[1691402817.84, "51125421.45258106"]}]

[ gstg ] production> client.query("avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gprd', type='patroni-ci'})")
=> [{"metric"=>{}, "value"=>[1691402836.828, "31577090.206746936"]}]

How to set up and validate locally

Prerequisite: As we need to query Thanos which can't be accessed from our local machine, we have to mock the promQL result in local.

Scenario 1: Signals::NotAvailable without the required feature flag

Feature.enabled?(:db_health_check_wal_rate, type: :ops)
=> false

context = OpenStruct.new(gitlab_schema: :gitlab_main)
indicator = Gitlab::Database::HealthStatus::Indicators::WalRate.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::NotAvailable:0x00000001631d5100 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalRate, @reason="indicator disabled">

Scenario 2: Signals::Unknown on empty prometheus alert settings

Feature.enable(:db_health_check_wal_rate)

application_setting = ApplicationSetting.last
old_prometheus_alert_db_indicators_settings = application_setting.prometheus_alert_db_indicators_settings
application_setting.update(prometheus_alert_db_indicators_settings: nil)

indicator.evaluate

=> #<Gitlab::Database::HealthStatus::Signals::Unknown:0x00000001663051d0 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalRate, @reason="Prometheus Settings not configured">

Scenario 3: Signals::Stop on WAL rate condition not being met

application_setting.update(prometheus_alert_db_indicators_settings: old_prometheus_alert_db_indicators_settings.merge(
  wal_rate_sli_query: {
    main: "avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni'})",
    ci: "avg(postgres:pg_xlog_bytes_per_second:rate5m{env='gstg', type='patroni-ci'})"
  },
  wal_rate_slo: {
    main: 70000000,
    ci: 70000000
  }
))

# Manually change Gitlab::PrometheusClient.ready? to `return true`
# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return a value above 70000000, eg: 80000000

reload!
indicator.evaluate # re-assign variable as needed

=> #<Gitlab::Database::HealthStatus::Signals::Stop:0x0000000162f47e10 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalRate, @reason="WalRate SLI condition not met">

Scenario 4: Signals::Normal on WAL rate condition being met

# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return a value below 70000000, eg: 50000000

reload!
indicator.evaluate # re-assign variable as needed

=> #<Gitlab::Database::HealthStatus::Signals::Normal:0x0000000164087f18 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalRate, @reason="WalRate SLI condition met">

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #357251 (closed)

Edited Aug 21, 2023 by Prabakaran Murugesan

Add WAL rate indicator to db health check