Infrastructure Alerting : Critical if disk usage > 40%
Regular change
Summary
Following the incident umami#503 we saw that the node disk utilization is increasing by spikes that account for double it's size, and if the peak utilization reaches 100% of the disk the node's storage gets corrupted and cannot recover.
This also relates to nomadic-labs/umami-wallet/umami#293 which was suggesting to double the size of the disks. Since nomadic-labs/umami-wallet/umami#293 implied changing the infrastructure, the recommendations were taken in consideration as part of the new servers /relate nomadic-labs/umami-wallet/umami#293
Area of the system
Umami-stackinfra Umami-stackmonitoring
How does this currently work?
alerts set in https://gitlab.com/nomadic-labs/umami-wallet/umami-stack/-/blob/monitoring/dockprom/prometheus/alert-rules.yml#L158
Click to expand
- name: Server metrics (from node-exporter)
rules:
- alert: NodeFilesystemSpaceFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available space left and is filling up.
summary: Filesystem is predicted to run out of space within the next 24 hours.
expr: |
(
node_filesystem_avail_bytes{job="nodeexporter",fstype!=""} / node_filesystem_size_bytes{job="nodeexporter",fstype!=""} * 100 < 40
and
predict_linear(node_filesystem_avail_bytes{job="nodeexporter",fstype!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemSpaceFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available space left and is filling up fast.
summary: Filesystem is predicted to run out of space within the next 4 hours.
expr: |
(
node_filesystem_avail_bytes{job="nodeexporter",fstype!=""} / node_filesystem_size_bytes{job="nodeexporter",fstype!=""} * 100 < 20
and
predict_linear(node_filesystem_avail_bytes{job="nodeexporter",fstype!=""}[6h], 4*60*60) < 0
and
node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available space left.
summary: Filesystem has less than 5% space left.
expr: |
(
node_filesystem_avail_bytes{job="nodeexporter",fstype!=""} / node_filesystem_size_bytes{job="nodeexporter",fstype!=""} * 100 < 5
and
node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available space left.
summary: Filesystem has less than 3% space left.
expr: |
(
node_filesystem_avail_bytes{job="nodeexporter",fstype!=""} / node_filesystem_size_bytes{job="nodeexporter",fstype!=""} * 100 < 3
and
node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemFilesFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available inodes left and is filling up.
summary: Filesystem is predicted to run out of inodes within the next 24 hours.
expr: |
(
node_filesystem_files_free{job="nodeexporter",fstype!=""} / node_filesystem_files{job="nodeexporter",fstype!=""} * 100 < 40
and
predict_linear(node_filesystem_files_free{job="nodeexporter",fstype!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemFilesFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available inodes left and is filling up fast.
summary: Filesystem is predicted to run out of inodes within the next 4 hours.
expr: |
(
node_filesystem_files_free{job="nodeexporter",fstype!=""} / node_filesystem_files{job="nodeexporter",fstype!=""} * 100 < 20
and
predict_linear(node_filesystem_files_free{job="nodeexporter",fstype!=""}[6h], 4*60*60) < 0
and
node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available inodes left.
summary: Filesystem has less than 5% inodes left.
expr: |
(
node_filesystem_files_free{job="nodeexporter",fstype!=""} / node_filesystem_files{job="nodeexporter",fstype!=""} * 100 < 5
and
node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available inodes left.
summary: Filesystem has less than 3% inodes left.
expr: |
(
node_filesystem_files_free{job="nodeexporter",fstype!=""} / node_filesystem_files{job="nodeexporter",fstype!=""} * 100 < 3
and
node_filesystem_readonly{job="nodeexporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
What is the desired way of working?
- Critical if regular usage > 40%
-
min_over_time(node_filesystem_avail_bytes{mountpoint="/opt"}[12h]) < 40%
instead ofnode_filesystem_avail_bytes{mountpoint="/opt"}) < 40%
No issues anymore with the node's peak corrupting the filesystem (or at least : we would have received alerts and not acted on them...)
Change Procedure
-
Change procedure been tested successfully
- edit https://gitlab.com/nomadic-labs/umami-wallet/umami-stack/-/blob/monitoring/dockprom/prometheus/alert-rules.yml
- restart prometheus (either fix the curl reload or trigger a restart by
kill -9 <PID>
)
Rollback plan
- restore the previous version of
monitoring/dockprom/prometheus/alert-rules.yml
from git -
- restart prometheus (either fix the curl reload or trigger a restart by
kill -9 <PID>
)
- restart prometheus (either fix the curl reload or trigger a restart by
Metadata
Approvals checklist (all required)
-
Approval from Development -
Approval from Operations -
Approval from Business
@picdc (cc: @remyzorg ) Please approve this regular change on development aspects
@comeh (cc: @philippewang) Please approve this regular change on operations aspects
@SamREye Please approve this regular change on business aspects