Saturation monitoring framework should support "medium-term" linear-interpolation alerting for imminent saturation
Corrective action for https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/49.
Introduction
Currently, the saturation monitoring uses soft and hard alerting thresholds, which, when breached will generate an alert.
This covers the immediate case, where saturation is occurring.
For longer-term monitoring (days, weeks, months), we rely on Tamland for forecasting potential saturation.
For certain resources, we should also consider the intermediate case, for example a disk filling up very rapidly. Because Tamland is run periodically (daily, weekly), a rapid saturation of a disk may occur between the periods in which Tamland executes.
This means that operators will only be alerted once the disk is very close to capacity.
An example of a resource saturating too fast for Tamland forecasting.
In some cases, a disk might be filling up at 20% per hour. If the saturation threshold is set to 90%, an operator has 30 minutes or less from the time of receiving the first alert to the time that the disk is completely filled.
Proposal
Add a new attribute to the resource definition to add linear interpolation over shorter periods than is useful for Tamland, for example a new linear_forecast_warning
attribute.
{
disk_space: resourceSaturationPoint({
title: 'Disk Space Utilization per Device per Node',
severity: 's2',
horizontallyScalable: true,
appliesTo: metricsCatalog.findVMProvisionedServices(first='gitaly'),
description: |||
Disk space utilization per device per node.
|||,
// New attribute for linear interpolation...
// Generate an alert if saturation is expected to occur within the next 6h
// based on linear interpolation of current growth
linear_prediction_warning: '6h',
grafana_dashboard_uid: 'sat_disk_space',
resourceLabels: [labelTaxonomy.getLabelFor(labelTaxonomy.labels.node), 'device'],
// We filter on `fqdn!=""` to filter out any nameless workers. This is done mostly for the ci-runner fleet
query: |||
(
1 - node_filesystem_avail_bytes{fstype=~"ext.|xfs", %(selector)s} / node_filesystem_size_bytes{fstype=~"ext.|xfs", %(selector)s}
)
|||,
// .. Other attributes
}),
}
Implementation
Under the hood, the alert could be based on the predict_linear
function on the existing recording rule, for example:
predict_linear(gitlab_component_saturation:ratio{env="gprd", component="disk_space"}[6h], 21600)
>= on (env, component, type)
max by (env, component, type) (slo:max:soft:gitlab_component_saturation:ratio{env="gprd"})
An inhibitor or additional clause could be used to silence the alert when the threshold is already exceeded.