Pause batched migration on patroni apdex below SLO for Gitlab.com
What does this MR do and why?
This builds on top of Pause data migrations upon active autovacuum (!85196 - merged) and Pause batched background migration when WAL pen... (!84555 - merged), and adds the ability to put batched migration on hold if Patroni apdex drops bellow the defined SLO.
Database Apdex settings
- Unlike the two existing indicators which fetch data from the database, this one uses Prometheus (Thanos) as the source.
- We have decided to have the connection API URL and the SLI queries in the application settings itself, so that we can configure it for each ENVs as few query variables (env) changes based on the environment.
- Actual Apdex SLO is stored directly instead of querying Thanos since this number shouldn't change often.
For GitLab.com (prod), database_apdex_settings will be:
{
prometheus_api_url: 'http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090',
apdex_sli_query: {
main: "avg_over_time(gitlab_service_apdex:ratio_5m{env='gprd',environment='gprd',monitor='global',type='patroni'}[5m])",
ci: "avg_over_time(gitlab_service_apdex:ratio_5m{env='gprd',environment='gprd',monitor='global',type='patroni-ci'}[5m])"
},
apdex_slo: {
main: 0.999,
ci: 0.999
}
}
Thanos responses on fetching Apdex SLIs:
[ gprd ] production> client = Gitlab::PrometheusClient.new('http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090', allow_local_requests: true, verify: true)
=> #<Gitlab::PrometheusClient:0x00007fd34a21a668 @api_url="http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090", @options={:allow_local_requests=>true, :verify=>true}>
[ gprd ] production> client.query('avg_over_time(gitlab_service_apdex:ratio_5m{env="gprd",environment="gprd",monitor="global",type="patroni"}[5m])')
=> [{"metric"=>{"env"=>"gprd", "environment"=>"gprd", "monitor"=>"global", "ruler_cluster"=>"thanos", "stage"=>"main", "tier"=>"db", "type"=>"patroni"}, "value"=>[1677849262.085, "0.9977587091684327"]}]
[ gprd ] production> client.query('avg_over_time(gitlab_service_apdex:ratio_5m{env="gprd",environment="gprd",monitor="global",type="patroni-ci"}[5m])')
=> []
CR issue to update the settings:
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
How to set up and validate locally
Prerequiste: As we need to query Thanos which can't be accessed from our local machine, we have to mock few data in the local machine and skip client.ready? check.
Scenario 1: Throws Signals::NotAvailable
when database_apdex_settings is not configured
> application_setting = ApplicationSetting.last
> application_setting.database_apdex_settings
=> nil
> context = OpenStruct.new(gitlab_schema: :main)
> indicator = Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex.new(context)
> indicator.evaluate
=> #<Gitlab::Database::BackgroundMigration::HealthStatus::Signals::NotAvailable:0x0000000134b37100
@indicator_class=Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex,
@reason="indicator disabled">
Scenario 2: Throws Signals::NotAvailable
without the required feature flag
> application_setting.udpate(database_apdex_settings: {
prometheus_api_url: 'http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090',
apdex_sli_query: {
main: "avg_over_time(gitlab_service_apdex:ratio_5m{env='gprd',environment='gprd',monitor='global',type='patroni'}[5m])",
ci: "avg_over_time(gitlab_service_apdex:ratio_5m{env='gprd',environment='gprd',monitor='global',type='patroni'}[5m])"
},
apdex_slo: {
main: 0.999,
ci: 0.999
}
})
> Feature.enabled?(:batched_migrations_health_status_patroni_apdex, type: :ops)
=> false
> indicator.evaluate
=> #<Gitlab::Database::BackgroundMigration::HealthStatus::Signals::NotAvailable:0x0000000151a8bdb0
@indicator_class=Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex,
@reason="indicator disabled">
Scenario 3: Throws Signals::Stop on Apdex SLI being below Apdex SLO
> Feature.enable(:batched_migrations_health_status_patroni_apdex)
=> true
# Manually change Indicators::PatroniApdex#fetch_sli method to return 0.995 (below SLI)
> indicator.evaluate
=> #<Gitlab::Database::BackgroundMigration::HealthStatus::Signals::Stop:0x0000000166ddc518
@indicator_class=Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex,
@reason="Patroni service apdex is below SLO">
Scenario 4: Signals::Normal when SLI is above SLO
# Manually change Indicators::PatroniApdex#fetch_sli method to return 0.9991 (above SLI)
> indicator.evaluate
=> #<Gitlab::Database::BackgroundMigration::HealthStatus::Signals::Normal:0x0000000143452628
@indicator_class=Gitlab::Database::BackgroundMigration::HealthStatus::Indicators::PatroniApdex,
@reason="Patroni service apdex is above SLO">
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #357250 (closed)