Skip to content

Saturation monitoring framework should support "medium-term" linear-interpolation alerting for imminent saturation

Corrective action for https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/49.

Introduction

Currently, the saturation monitoring uses soft and hard alerting thresholds, which, when breached will generate an alert.

This covers the immediate case, where saturation is occurring.

For longer-term monitoring (days, weeks, months), we rely on Tamland for forecasting potential saturation.

For certain resources, we should also consider the intermediate case, for example a disk filling up very rapidly. Because Tamland is run periodically (daily, weekly), a rapid saturation of a disk may occur between the periods in which Tamland executes.

This means that operators will only be alerted once the disk is very close to capacity.

screenshot-andrewn-2023-08-03T07h32Z_2x
An example of a resource saturating too fast for Tamland forecasting.

In some cases, a disk might be filling up at 20% per hour. If the saturation threshold is set to 90%, an operator has 30 minutes or less from the time of receiving the first alert to the time that the disk is completely filled.


Proposal

Add a new attribute to the resource definition to add linear interpolation over shorter periods than is useful for Tamland, for example a new linear_forecast_warning attribute.

{
  disk_space: resourceSaturationPoint({
    title: 'Disk Space Utilization per Device per Node',
    severity: 's2',
    horizontallyScalable: true,
    appliesTo: metricsCatalog.findVMProvisionedServices(first='gitaly'),
    description: |||
      Disk space utilization per device per node.
    |||,

    // New attribute for linear interpolation...
    // Generate an alert if saturation is expected to occur within the next 6h
    // based on linear interpolation of current growth
    linear_prediction_warning: '6h',

    grafana_dashboard_uid: 'sat_disk_space',
    resourceLabels: [labelTaxonomy.getLabelFor(labelTaxonomy.labels.node), 'device'],
    // We filter on `fqdn!=""` to filter out any nameless workers. This is done mostly for the ci-runner fleet
    query: |||
      (
        1 - node_filesystem_avail_bytes{fstype=~"ext.|xfs", %(selector)s} / node_filesystem_size_bytes{fstype=~"ext.|xfs", %(selector)s}
      )
    |||,
    // .. Other attributes
  }),
}

Implementation

Under the hood, the alert could be based on the predict_linear function on the existing recording rule, for example:

predict_linear(gitlab_component_saturation:ratio{env="gprd", component="disk_space"}[6h], 21600) 
>= on (env, component, type) 
max by (env, component, type) (slo:max:soft:gitlab_component_saturation:ratio{env="gprd"})

An inhibitor or additional clause could be used to silence the alert when the threshold is already exceeded.

cc @o-lluch @cmiskell @reprazent @abrandl

Edited by Andrew Newdigate