Infrastructure Alerting : Critical if disk usage > 40%

Regular change
Rollback plan
Metadata
- Approvals checklist (all required)

Regular change

Summary

Following the incident umami#503 we saw that the node disk utilization is increasing by spikes that account for double it's size, and if the peak utilization reaches 100% of the disk the node's storage gets corrupted and cannot recover.

This also relates to nomadic-labs/umami-wallet/umami#293 which was suggesting to double the size of the disks. Since nomadic-labs/umami-wallet/umami#293 implied changing the infrastructure, the recommendations were taken in consideration as part of the new servers /relate nomadic-labs/umami-wallet/umami#293

Area of the system

Umami-stackinfra Umami-stackmonitoring

How does this currently work?

alerts set in https://gitlab.com/nomadic-labs/umami-wallet/umami-stack/-/blob/monitoring/dockprom/prometheus/alert-rules.yml#L158

Click to expand

- name: Server metrics (from node-exporter) 
  rules:
  - alert: NodeFilesystemSpaceFillingUp
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available space left and is filling up.
      summary: Filesystem is predicted to run out of space within the next 24 hours.
    expr: |
      (
        node_filesystem_avail_bytes{job="nodeexporter",fstype!=""} / node_filesystem_size_bytes{job="nodeexporter",fstype!=""} * 100 < 40
      and
        predict_linear(node_filesystem_avail_bytes{job="nodeexporter",fstype!=""}[6h], 24*60*60) < 0
      and
        node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
      )
    for: 1h
    labels:
      severity: warning
  - alert: NodeFilesystemSpaceFillingUp
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available space left and is filling up fast.
      summary: Filesystem is predicted to run out of space within the next 4 hours.
    expr: |
      (
        node_filesystem_avail_bytes{job="nodeexporter",fstype!=""} / node_filesystem_size_bytes{job="nodeexporter",fstype!=""} * 100 < 20
      and
        predict_linear(node_filesystem_avail_bytes{job="nodeexporter",fstype!=""}[6h], 4*60*60) < 0
      and
        node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
      )
    for: 1h
    labels:
      severity: critical
  - alert: NodeFilesystemAlmostOutOfSpace
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available space left.
      summary: Filesystem has less than 5% space left.
    expr: |
      (
        node_filesystem_avail_bytes{job="nodeexporter",fstype!=""} / node_filesystem_size_bytes{job="nodeexporter",fstype!=""} * 100 < 5
      and
        node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
      )
    for: 1h
    labels:
      severity: warning
  - alert: NodeFilesystemAlmostOutOfSpace
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available space left.
      summary: Filesystem has less than 3% space left.
    expr: |
      (
        node_filesystem_avail_bytes{job="nodeexporter",fstype!=""} / node_filesystem_size_bytes{job="nodeexporter",fstype!=""} * 100 < 3
      and
        node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
      )
    for: 1h
    labels:
      severity: critical
  - alert: NodeFilesystemFilesFillingUp
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available inodes left and is filling up.
      summary: Filesystem is predicted to run out of inodes within the next 24 hours.
    expr: |
      (
        node_filesystem_files_free{job="nodeexporter",fstype!=""} / node_filesystem_files{job="nodeexporter",fstype!=""} * 100 < 40
      and
        predict_linear(node_filesystem_files_free{job="nodeexporter",fstype!=""}[6h], 24*60*60) < 0
      and
        node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
      )
    for: 1h
    labels:
      severity: warning
  - alert: NodeFilesystemFilesFillingUp
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available inodes left and is filling up fast.
      summary: Filesystem is predicted to run out of inodes within the next 4 hours.
    expr: |
      (
        node_filesystem_files_free{job="nodeexporter",fstype!=""} / node_filesystem_files{job="nodeexporter",fstype!=""} * 100 < 20
      and
        predict_linear(node_filesystem_files_free{job="nodeexporter",fstype!=""}[6h], 4*60*60) < 0
      and
        node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
      )
    for: 1h
    labels:
      severity: critical
  - alert: NodeFilesystemAlmostOutOfFiles
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available inodes left.
      summary: Filesystem has less than 5% inodes left.
    expr: |
      (
        node_filesystem_files_free{job="nodeexporter",fstype!=""} / node_filesystem_files{job="nodeexporter",fstype!=""} * 100 < 5
      and
        node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
      )
    for: 1h
    labels:
      severity: warning
  - alert: NodeFilesystemAlmostOutOfFiles
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available inodes left.
      summary: Filesystem has less than 3% inodes left.
    expr: |
      (
        node_filesystem_files_free{job="nodeexporter",fstype!=""} / node_filesystem_files{job="nodeexporter",fstype!=""} * 100 < 3
      and
        node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
      )
    for: 1h
    labels:
      severity: critical

What is the desired way of working?

Critical if regular usage > 40%
min_over_time(node_filesystem_avail_bytes{mountpoint="/opt"}[12h]) < 40% instead of node_filesystem_avail_bytes{mountpoint="/opt"}) < 40%

No issues anymore with the node's peak corrupting the filesystem (or at least : we would have received alerts and not acted on them...)

Change Procedure

Change procedure been tested successfully

edit https://gitlab.com/nomadic-labs/umami-wallet/umami-stack/-/blob/monitoring/dockprom/prometheus/alert-rules.yml
restart prometheus (either fix the curl reload or trigger a restart by kill -9 <PID> )

Rollback plan

restore the previous version of monitoring/dockprom/prometheus/alert-rules.yml from git
- restart prometheus (either fix the curl reload or trigger a restart by kill -9 <PID> )

Metadata

Approvals checklist (all required)

Approval from Development
Approval from Operations
Approval from Business

@picdc (cc: @remyzorg ) Please approve this regular change on development aspects

@comeh (cc: @philippewang) Please approve this regular change on operations aspects

@SamREye Please approve this regular change on business aspects

Edited Jun 13, 2022 by Rémy El Sibaïe