Long-term saturation forecasting
Addresses https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7217
EXPERIMENTAL
Long-term saturation forecast is not focused on immediate changes, but rather long-term trends. It uses week-on-week growth to predict several metrics:
Average Values:
- What the will the average value of the saturation metric be in two weeks time?
- What the will the average value of the saturation metric be in 30 days time?
Sapdex Trends:
"Sapdex is for saturation what apdex is for latency." -- @andrewn
Averages don't work very well for saturation as it can be spiky. Instead we invent the idea of sapdex.
The sapdex score for a metric is as follows:
- 0% -> Soft Saturation Threshold = 100%
- Soft Threshold -> Hard Saturation Threshold = 50%
- Hard Saturation Threshold -> 100% = 0%
The threshold are defined depending on the saturation metric. For example, for CPU, we might use 80% and 90% as our thresholds. We can then calculate the amount of time the CPU for a service exceeds these thresholds and measure this as sapdex over the week.
We measure the sapdex score for each metric when calculate weekly averages and trends to calculate a long-term forecast for when the metric will become saturation.
Using week-on-week sapdex trends, we can predict when a particular resource (for example single_core_cpu on the Redis primary) will become exhausted.
Demo Graph
This graph shows the sapdex scores for single_core_cpu for Redis, Redis-Cache and pgbouncer over the past 12 hours.
Redis-Cache and pgbouncer are doing well, Redis-Persistent not so much.
