High disk %util on /, /var/log on production Postgres master
Disks for OS (sda
) and logs (sdc
) are quite busy on patroni-02
now and I think it needs investigation.
Symptoms:
-
SSHed to patroni-02, noticed that to start
gitlab-psql
takes many seconds, the same thing to run other commands such asiostat
. -
Checking disk IO with
iostat
:
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 6.9T 0 disk /var/opt/gitlab
sdc 8:32 0 50G 0 disk /var/log
sda 8:0 0 100G 0 disk
└─sda1 8:1 0 100G 0 part /
patroni-02-db-gprd.c.gitlab-production.internal:~$ sudo iostat -x 5
Linux 4.10.0-1009-gcp (patroni-02-db-gprd) 10/25/2019 _x86_64_ (96 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.14 0.01 1.70 1.61 0.00 90.54
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 3.20 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 2.99 4.65 31.60 116.22 7537.64 422.29 0.34 9.31 33.99 5.68 9.41 34.13
sdb 0.00 62.13 565.17 2028.00 33387.36 35623.60 53.23 0.53 0.21 0.87 0.02 0.17 44.58
sdc 0.01 1.63 0.38 1.25 40.74 155.97 241.54 0.46 282.59 131.37 329.04 81.32 13.25
avg-cpu: %user %nice %system %iowait %steal %idle
9.88 0.00 2.41 4.86 0.00 82.86
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 66.20 0.00 16947.20 512.00 47.48 690.33 0.00 690.33 15.11 100.00
sdb 0.00 49.40 427.40 2175.80 4150.40 20119.20 18.65 2.05 0.79 0.87 0.77 0.20 52.24
sdc 0.00 0.20 3.60 3.40 216.80 796.80 289.60 3.80 539.54 708.44 360.71 142.86 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
9.09 0.00 3.02 5.09 0.00 82.80
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 69.80 0.00 17868.80 512.00 53.57 688.61 0.00 688.61 14.33 100.00
sdb 0.00 47.20 564.60 3911.80 5222.40 42401.60 21.28 3.64 0.81 0.87 0.81 0.14 62.32
sdc 0.00 0.00 1.20 0.60 47.20 153.60 223.11 2.83 1412.89 1412.00 1414.67 555.56 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
10.19 0.00 2.35 4.99 0.00 82.47
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 69.60 0.00 17817.60 512.00 48.55 748.92 0.00 748.92 14.37 100.00
sdb 0.00 103.40 372.40 2556.40 3267.20 26540.00 20.35 2.32 0.79 0.97 0.77 0.18 54.08
sdc 0.00 0.00 1.80 0.80 172.80 136.80 238.15 3.37 1388.62 1380.00 1408.00 384.62 100.00
-- notice %util 100 for sda
(/
) and sdc
(/var/log
), and what is worse, high latencies, at levels of ~1 second.
Main concerns:
-
significant writes to
sda
. What is causing this? Not sure --iotop
didn't help, Postgres writes are on the top. Anddstat
is not installed. -
sdc
is also not looking good. This might affect Postgres performance at some point (we migrated to the logging collector, which is good, but still, the numbers above look not good).
Also, I had difficulties with finding good numbers for disk latency graphs in Grafana. Found this (internal), it seems to be broken showing minutes instead of seconds -- this is something that is worth fixing separately:
I suspect that min
on this graph is sec
Then we can check the history and see:
-- obviously, this is a not new problem. Started Oct-02, when failover from patroni-01 happened. Patroni-01 had the same behavior before (however with smaller numbers):
I've checked replicas and they look fine.
Passing this to @Finotto to decide on further actions.