Skip to content

Pause batched migration on patroni apdex below SLO for Gitlab.com

Krasimir Angelov requested to merge 357250-bbm-patroni-apdex-signaling into master

What does this MR do and why?

This builds on top of Pause data migrations upon active autovacuum (!85196 - merged) and Pause batched background migration when WAL pen... (!84555 - merged), and adds the ability to put batched migration on hold if Patroni apdex drops bellow the defined SLO.

Database Apdex settings

  • Unlike the two existing indicators which fetch data from the database, this one uses Prometheus (Thanos) as the source.
  • We have decided to have the connection API URL and the SLI queries in the application settings itself, so that we can configure it for each ENVs as few query variables (env) changes based on the environment.
  • Actual Apdex SLO is stored directly instead of querying Thanos since this number shouldn't change often.

For GitLab.com (prod), database_apdex_settings will be:

{
  prometheus_api_url: 'http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090',
  apdex_sli_query: {
    main: "avg_over_time(gitlab_service_apdex:ratio_5m{env='gprd',environment='gprd',monitor='global',type='patroni'}[5m])",
    ci: "avg_over_time(gitlab_service_apdex:ratio_5m{env='gprd',environment='gprd',monitor='global',type='patroni-ci'}[5m])"
  },
  apdex_slo: {
    main: 0.999,
    ci: 0.999
  }
}

Thanos responses on fetching Apdex SLIs:

[ gprd ] production> client = Gitlab::PrometheusClient.new('http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090', allow_local_requests: true, verify: true)
=> #<Gitlab::PrometheusClient:0x00007fd34a21a668 @api_url="http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090", @options={:allow_local_requests=>true, :verify=>true}>

[ gprd ] production> client.query('avg_over_time(gitlab_service_apdex:ratio_5m{env="gprd",environment="gprd",monitor="global",type="patroni"}[5m])')
=> [{"metric"=>{"env"=>"gprd", "environment"=>"gprd", "monitor"=>"global", "ruler_cluster"=>"thanos", "stage"=>"main", "tier"=>"db", "type"=>"patroni"}, "value"=>[1677849262.085, "0.9977587091684327"]}]

[ gprd ] production> client.query('avg_over_time(gitlab_service_apdex:ratio_5m{env="gprd",environment="gprd",monitor="global",type="patroni-ci"}[5m])')
=> []

CR issue to update the settings:

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Prerequiste: As we need to query Thanos which can't be accessed from our local machine, we have to mock few data in the local machine and skip client.ready? check.

Scenario 1: Throws Signals::NotAvailable when database_apdex_settings is not configured

> application_setting = ApplicationSetting.last
> application_setting.database_apdex_settings
=> nil

> context = OpenStruct.new(gitlab_schema: :main)
> indicator = Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex.new(context)
> indicator.evaluate

=> #<Gitlab::Database::BackgroundMigration::HealthStatus::Signals::NotAvailable:0x0000000134b37100
 @indicator_class=Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex,
 @reason="indicator disabled">

Scenario 2: Throws Signals::NotAvailable without the required feature flag

> application_setting.udpate(database_apdex_settings: {
  prometheus_api_url: 'http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090',
  apdex_sli_query: {
    main: "avg_over_time(gitlab_service_apdex:ratio_5m{env='gprd',environment='gprd',monitor='global',type='patroni'}[5m])",
    ci: "avg_over_time(gitlab_service_apdex:ratio_5m{env='gprd',environment='gprd',monitor='global',type='patroni'}[5m])"
  },
  apdex_slo: {
    main: 0.999,
    ci: 0.999
  }
})

> Feature.enabled?(:batched_migrations_health_status_patroni_apdex, type: :ops)
=> false

> indicator.evaluate
=> #<Gitlab::Database::BackgroundMigration::HealthStatus::Signals::NotAvailable:0x0000000151a8bdb0
 @indicator_class=Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex,
 @reason="indicator disabled">

Scenario 3: Throws Signals::Stop on Apdex SLI being below Apdex SLO

> Feature.enable(:batched_migrations_health_status_patroni_apdex)
=> true

# Manually change Indicators::PatroniApdex#fetch_sli method to return 0.995 (below SLI)

> indicator.evaluate
=> #<Gitlab::Database::BackgroundMigration::HealthStatus::Signals::Stop:0x0000000166ddc518
 @indicator_class=Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex,
 @reason="Patroni service apdex is below SLO">

Scenario 4: Signals::Normal when SLI is above SLO

# Manually change Indicators::PatroniApdex#fetch_sli method to return 0.9991 (above SLI)

> indicator.evaluate
=> #<Gitlab::Database::BackgroundMigration::HealthStatus::Signals::Normal:0x0000000143452628
 @indicator_class=Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex,
 @reason="Patroni service apdex is above SLO">

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #357250 (closed)

Edited by Prabakaran Murugesan

Merge request reports