High disk %util on /, /var/log on production Postgres master

Disks for OS (sda) and logs (sdc) are quite busy on patroni-02 now and I think it needs investigation.

Symptoms:

SSHed to patroni-02, noticed that to start gitlab-psql takes many seconds, the same thing to run other commands such as iostat.
Checking disk IO with iostat:

$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb      8:16   0  6.9T  0 disk /var/opt/gitlab
sdc      8:32   0   50G  0 disk /var/log
sda      8:0    0  100G  0 disk
└─sda1   8:1    0  100G  0 part /


patroni-02-db-gprd.c.gitlab-production.internal:~$ sudo iostat -x 5
Linux 4.10.0-1009-gcp (patroni-02-db-gprd) 	10/25/2019 	_x86_64_	(96 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.14    0.01    1.70    1.61    0.00   90.54

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
loop0             0.00     0.00    0.00    0.00     0.00     0.00     3.20     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     2.99    4.65   31.60   116.22  7537.64   422.29     0.34    9.31   33.99    5.68   9.41  34.13
sdb               0.00    62.13  565.17 2028.00 33387.36 35623.60    53.23     0.53    0.21    0.87    0.02   0.17  44.58
sdc               0.01     1.63    0.38    1.25    40.74   155.97   241.54     0.46  282.59  131.37  329.04  81.32  13.25

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.88    0.00    2.41    4.86    0.00   82.86

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
loop0             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00   66.20     0.00 16947.20   512.00    47.48  690.33    0.00  690.33  15.11 100.00
sdb               0.00    49.40  427.40 2175.80  4150.40 20119.20    18.65     2.05    0.79    0.87    0.77   0.20  52.24
sdc               0.00     0.20    3.60    3.40   216.80   796.80   289.60     3.80  539.54  708.44  360.71 142.86 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.09    0.00    3.02    5.09    0.00   82.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
loop0             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00   69.80     0.00 17868.80   512.00    53.57  688.61    0.00  688.61  14.33 100.00
sdb               0.00    47.20  564.60 3911.80  5222.40 42401.60    21.28     3.64    0.81    0.87    0.81   0.14  62.32
sdc               0.00     0.00    1.20    0.60    47.20   153.60   223.11     2.83 1412.89 1412.00 1414.67 555.56 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.19    0.00    2.35    4.99    0.00   82.47

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
loop0             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00   69.60     0.00 17817.60   512.00    48.55  748.92    0.00  748.92  14.37 100.00
sdb               0.00   103.40  372.40 2556.40  3267.20 26540.00    20.35     2.32    0.79    0.97    0.77   0.18  54.08
sdc               0.00     0.00    1.80    0.80   172.80   136.80   238.15     3.37 1388.62 1380.00 1408.00 384.62 100.00

-- notice %util 100 for sda (/) and sdc (/var/log), and what is worse, high latencies, at levels of ~1 second.

Main concerns:

significant writes to sda. What is causing this? Not sure -- iotop didn't help, Postgres writes are on the top. And dstat is not installed.
sdc is also not looking good. This might affect Postgres performance at some point (we migrated to the logging collector, which is good, but still, the numbers above look not good).

Also, I had difficulties with finding good numbers for disk latency graphs in Grafana. Found this (internal), it seems to be broken showing minutes instead of seconds -- this is something that is worth fixing separately:

I suspect that min on this graph is sec 🐛.

Then we can check the history and see:

-- obviously, this is a not new problem. Started Oct-02, when failover from patroni-01 happened. Patroni-01 had the same behavior before (however with smaller numbers):

I've checked replicas and they look fine.

Passing this to @Finotto to decide on further actions.

Edited Nov 15, 2019 by Nikolay Samokhvalov