Add saturation as a general metric
Addresses https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7217
Over the past few days, for example in the following incidents, we have reached saturation on a resource.
Saturation is one of the "golden signals" mentioned in Google SRE book: https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
Currently we don't monitor this as a key metric. The past few weeks have shown that we should.
Saturation is modelled as a known finite upper limit for a given resource. Each resource can have multiple saturation components.
For example, saturation can include memory, cpu, single cores (for single threaded services such as Redis)
The saturation metric for a service is aggregated as the maximum saturation point of any of the components of that service.
For example, if the widget service has the following saturation metrics
-
cpu80% -
memory50% -
database_connections95%
Then, the saturation of the service is 95%.
This is because saturation can be thought of as a bottleneck. A service is as saturated as it's most saturated component.