Long-term saturation forecasting

Addresses https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7217

EXPERIMENTAL

Long-term saturation forecast is not focused on immediate changes, but rather long-term trends. It uses week-on-week growth to predict several metrics:

Average Values:

  1. What the will the average value of the saturation metric be in two weeks time?
  2. What the will the average value of the saturation metric be in 30 days time?

Sapdex Trends:

"Sapdex is for saturation what apdex is for latency." -- @andrewn

Averages don't work very well for saturation as it can be spiky. Instead we invent the idea of sapdex.

The sapdex score for a metric is as follows:

  1. 0% -> Soft Saturation Threshold = 100%
  2. Soft Threshold -> Hard Saturation Threshold = 50%
  3. Hard Saturation Threshold -> 100% = 0%

The threshold are defined depending on the saturation metric. For example, for CPU, we might use 80% and 90% as our thresholds. We can then calculate the amount of time the CPU for a service exceeds these thresholds and measure this as sapdex over the week.

We measure the sapdex score for each metric when calculate weekly averages and trends to calculate a long-term forecast for when the metric will become saturation.

Using week-on-week sapdex trends, we can predict when a particular resource (for example single_core_cpu on the Redis primary) will become exhausted.

Demo Graph

This graph shows the sapdex scores for single_core_cpu for Redis, Redis-Cache and pgbouncer over the past 12 hours.

Redis-Cache and pgbouncer are doing well, Redis-Persistent not so much.

image

https://prometheus.gprd.gitlab.net/graph?g0.range_input=12h&g0.expr=avg_over_time(%0A(%0Aclamp_min(gitlab_component_saturation%3Aratio%20%3C%3D%20on(component)%20group_left%20slo%3Amax%3Asoft%3Agitlab_component_saturation%3Aratio%2C%201)%0A%20%20%20%20%20%20or%0A%20%20%20%20%20%20clamp_min(clamp_max(gitlab_component_saturation%3Aratio%20%3E%20on(component)%20group_left%20slo%3Amax%3Asoft%3Agitlab_component_saturation%3Aratio%2C%200.5)%2C%200.5)%0A%20%20%20%20%20%20or%0A%20%20%20%20%20%20clamp_max(gitlab_component_saturation%3Aratio%20%3E%20on(component)%20group_left%20slo%3Amax%3Ahard%3Agitlab_component_saturation%3Aratio%2C%200)%0A)%5B30m%3A%5D%0A)&g0.tab=0

Edited by Andrew Newdigate

Merge request reports

Loading