Analyze current PostgreSQL disk I/O usage in Azure and compare with what's available in GCP

Goal

The main goal of the ticket is to understand whether the (vendor-provided) disk performance in GCP is adequate enough given the current, observed I/O load on Azure (PostgreSQL master).

It is not a goal to characterize GCP performance or compare with Azure performance. But to find whether theoretical GCP disk performance is enough, based on a deep understanding of the measured performance in Azure.

TLDR; there's a Conclusions section towards the end.

Method

First, real I/O needs to be captured on Azure master server (postgres-02.db.prd.gitlab.com). To do this, iotop was used between August 1st, 22:28:37 and August 2nd, 08:45:21 (UTC). iotop was asked to sample I/O every second, filtering by PostgreSQL-only processes and stats where captured to a file.

Later the file was imported into an offline PostgreSQL database where data cleanup was performed. Based on this data, a series of materialized views, functions and queries where created to obtain the data. Graphs shown below were generated with LibreOffice Calc.

At the end of the ticket description there's the code used to perform this, so that results may be reproduced.

Results

Read I/O

GCP SSD PD with 5TB disk and more than 32 vCPUs are supposed to deliver up to 800MB/s sustained reads.

At a 1-second granularity, read I/O performance, measured in MB/s, observed in Azure over the 10h+ measured period was:

The first observation is that there are a few peaks that reach more than GCP's disk capability (800MB/s). However, most of the workload fits very comfortably under this disk limit. A "zoom" of the capacity above 800:

Visual inspection reveals a dozen or two peaks. Detail inspection via SQL delivers a total of 101 seconds (spread across the 10h+ hours, i.e. a 0.28%) where the read I/O was above 800MB/s. The maximum I/O read on any given second was 1045MB/s.

The question is: is this important? The way I/O works is that I/O that cannot be immediately fulfilled by the disk, will be queued. So if at a given second the I/O is greater than what the disk can deliver, its excess will need to be absorbed "within the next second(s)". So let's do some SQL queries based on window functions to aggregate and average I/O over different periods of time, to see how quickly (or slowly) these peaks are absorbed.

2-second averaged I/O:

Visual inspection reveals that peaks over 800MB/s (over 2 seconds) have been reduced (to just 9). Looks good, it means that peaks over 800MB/s are brief, and thus can be accommodated after a 1-second or few seconds wait. Indeed, peaks have very clear delta-shapes.

If we average over 10-seconds period:

We can see that no longer peaks over 800MB/s exists, and all I/O is perfectly handled by the GCP disk.

What all this means is that:

Most of the read I/O is handled well below the 800MB/s sustained advertised rate.
There are a few peaks (representing 0.28% of the time) where I/O is a bit above that read cap. However, these peaks are very spiky and most are absorbed within the next second, and a few require sometime between 3 seconds and 9 seconds to be absorbed. This could mean, theoretically, that queries hitting PostgreSQL on these exact moments (0.28% of the time) could experience a higher latency, up to a maximum of 9 seconds in the worst case.

Now: what is in average, the disk utilization (i.e., aggregated disk usage vs theoretical capacity of 800MB/s sustained)? It turns out to be a very surprising value: 0.5% (graphs seem to indicate a higher utilization, but this is just due to graphing artifacts of a very high frequency plot over a limited graph real state).

A second observation is who is responsible for the above I/O?

	total	query	autovacuum	checkpoint	bgwriter	wale	pgbouncer	other
%	100.00	1.57	75.22	0.00	0.00	23.18	0.00	0.03
GB	6966.06	109.62	5240.09	0.02	0.00	1614.55	0.00	1.78

Wow! So autovacuum is consuming 75% of all the read I/O! That means that approximately, autovacuum is re-reading the whole db every 3h!

This looks to be quite excessive waste of read I/O on autovacuum alone. The second contributor to read I/O is... WAL-e! And this will have significant impact on write I/O too, as we will see below. Actually, apart from autovacuum and WAL-e, queries and other read activities don't even add up to 2% of all the read I/O. There's clearly room for optimization here, to significantly reduce the read I/O.

Actually, a simulation was performed, where it was assumed that autovacuum activity could be cut down to 1/3 of what is currently now. This is the read I/O result, at a 1-second interval (no aggregation + averaging):

All the read I/O would fit, perfectly, under the 800MB/s cap.

Write I/O

GCP SSD PD with 5TB disk and more than 32 vCPUs are supposed to deliver up to 400MB/s sustained writes.

At a 1-second granularity, write I/O performance, measured in MB/s, observed in Azure over the 10h+ measured period was:

In this case, there's no peak over 400MB/s, so all write I/O fits perfectly under the GCP disk performance cap.

Indeed, if we aggregate and average over 10-seconds, similarly to what was done for the reads, we can see that all I/O fits very comfortably:

Measured write disk utilization, following the same technique as for the reads, is a paltry 0.1%.

The main observation here is who is the write processes that consume the most I/O:

	total	query	autovacuum	checkpoint	bgwriter	wale	gpg	pgbouncer	stats	other
%	100.00	14.77	2.63	14.66	0.11	0.00	66.31	0.00	0.34	0.67
GB	1170.45	172.87	30.81	171.64	1.28	0.01	776.13	0.01	4.00	7.84

Again, wow! Here gpg (driven by WAL-e) is eating 66% of all the write I/O! It has not been determined the cause for this, as it seems quite excessive, but it requires further investigation, and is again a significant performance improvement area if this could be reduced or eliminated (ideally, offloaded to other servers, as WAL archiving should not waste this many resources from the master, where disk I/O is precious).

Actually, gpg activity seems to be confined to a given, precise window, starting on August 2nd 00:06:55 UTC and finishing at 05:28:50 UTC.

Another simulation has been performed, to see how write I/O would look like if gpg was totally offloaded to an external server:

In this case, it can be seen that write I/O fits extremely comfortably within the write cap.

Conclusions

Read I/O only exceeds GCP disk capacity on 0.28% of the time, but those are short-lived peaks. Most of those peaks are absorbed within 2 seconds, and some last between 3 and 9 seconds. Average disk utilization is 0.5% of the capacity. This could mean that queries happening over those peaks could experience higher latencies due to I/O read contention.
Read capacity is mostly dominated by a very aggressive autovacuum, that takes 75% of the read I/O. If autovacuum activity is reduced to 1/3, no peaks surpass GCP disk capacity and all I/O fits comfortably within the GCP disk cap. Autovacuum is currently scanning the whole DB every 3h, approximately. Issue gitlab-com/database#129 (closed) has been created to track this.
Write I/O never exceeded GCP disk capacity. Average disk utilization is 0.1%.
2/3s of write I/O are driven by WAL-e, encrypting the files. This could be, potentially, offloaded to an external server. WAL-e is also generating a lot of read I/O, required to read the files that need to be encrypted. In total, WAL-e accounts for 29% of the total read + write I/O. Issue https://gitlab.com/gitlab-com/infrastructure/issues/4654 has been created to track this.

Source code used to derive this results (YMMV)

Execute in Azure in first place the iotop capture:

iotop.commands.sh

# @Azure master
sudo iotop -k -b -n $(( 3600 * 10 )) -o -t -u gitlab-psql  -qq -qqq |awk '{print $1, $5, $7, $14, $15, $16, $17, $18}' |sed 's-gitlab gitlabhq_production\ .*-gitlab gitlabhq_production-' |gzip > iotop.csv

# @local env, scp iotop.csv
gunzip -dc iotop.gz |sed 's- -,-' |sed 's- -,-' |sed 's- -,-' > /tmp/iotop.csv

The copy the generated iotop.csv locally, create a local database in PostgreSQL, and have fun with SQL:

iotop.commands.sql

create table iotop(t time, read double precision, write double precision, what text);

\copy iotop from '/tmp/iotop.csv' csv

-- data cleaning
delete from iotop where read=0 and write=0;
update iotop set what = 'wal-e' where what like '%wal-e%';
update iotop set what = 'wal-e gpg' where what = '-e -z 0 -r 66B9829C';
update iotop set what = 'pgbouncer' where what like 'pgbouncer gitlabhq_production %';
update iotop set what = 'gitlab-monitor' where what like 'gitlab-monitor%';
update iotop set what = 'archiver' where what like 'archiver%';
update iotop set what = 'wal sender' where what like 'wal sender%';
update iotop set what = 'repmgr' where what like 'gitlab_repmgr gitlab_repmgr%';
update iotop set what = 'pgbouncer' where what like '/var/opt/gitlab/pgbouncer/pgbouncer.ini%';
update iotop set what = 'query' where what = 'gitlab gitlabhq_production';
update iotop set what = 'bgwriter' where what like 'writer process%';
update iotop set what = 'autovacuum' where what like 'autovacuum worker process%';
update iotop set what = 'stats' where what like 'stats collector process%';
update iotop set what = 'wal writer' where what like 'wal writer process%';
update iotop set what = 'checkpoint' where what like 'checkpointer process%';
update iotop set what = 'other' where what not in ('wal-e', 'checkpoint', 'repmgr', 'pgbouncer', 'wal writer', 'wal sender', 'bgwriter', 'archiver', 'stats', 'autovacuum', 'query', 'wal-e gpg') or what is null;

alter table iotop add column t2 integer;
update iotop set t2 = extract(epoch from t - '22:28:37'::time) + (case when t >= '22:28:37'::time then 0 else 3600 * 24 end);


create materialized view iotop_agg as
        select t2, sum(read) as read, sum(write) as write, what from iotop group by t2, what order by t2 asc;


-- Reads

create materialized view iotop_reads as select t2 as t
        ,sum(read)/1024 as total
        ,coalesce(sum(read) filter (where what = 'query')/1024, 0) as query
        ,coalesce(sum(read) filter (where what = 'autovacuum')/1024, 0) as autovacuum
        ,coalesce(sum(read) filter (where what = 'checkpoint')/1024, 0) as checkpoint
        ,coalesce(sum(read) filter (where what = 'bgwriter')/1024, 0) as bgwriter
        ,coalesce(sum(read) filter (where what = 'wal-e')/1024, 0) as wale
        ,coalesce(sum(read) filter (where what = 'pgbouncer')/1024, 0) as pgbouncer
        ,coalesce(sum(read) filter (where what = 'other')/1024, 0) as other
from iotop_agg group by t2;

\copy (select * from iotop_reads order by t asc) to '/tmp/reads-01s.csv' csv header

\copy (
        with totals_gb as (
                select  sum(total)/1024 as total, sum(query)/1024 query, sum(autovacuum)/1024 autovacuum,
                        sum(checkpoint)/1024 "checkpoint", sum(bgwriter)/1024 bgwriter, sum(wale)/1024 wale,
                        sum(pgbouncer)/1024 pgbouncer, sum(other)/1024 other
                from iotop_reads
        )
        select total, query, autovacuum, "checkpoint", bgwriter, wale, pgbouncer, other from totals_gb
        union select total / total, query / total, autovacuum/ total, "checkpoint"/ total, bgwriter/ total, wale/ total,
                        pgbouncer/ total, other / total from totals_gb
) to '/tmp/read_totals.csv' csv header

create function reads_avg(int) returns setof record AS $$
        with
                io2 as (select t/$1 as t2, * from iotop_reads),
                io2_agg as (
                        select t2, row_number() over w as row
                        , sum(total) over w / $1 as total
                        , sum(query) over w / $1 as query
                        , sum(autovacuum) over w / $1 as autovacuum
                        , sum(checkpoint) over w / $1 as checkpoint
                        , sum(bgwriter) over w / $1 as bgwriter
                        , sum(wale) over w / $1 as wale
                        , sum(pgbouncer) over w / $1 as pgbouncer
                        , sum(other) over w / $1 as other
                        from io2
                        window w as (order by t2 rows between $1 - 1 preceding and current row)
                )
        select t2, total, query, autovacuum, checkpoint, bgwriter, wale, pgbouncer, other
        from io2_agg where row % $1 = ($1 - 1) order by t2 asc
$$ language sql;

\copy (select * from reads_avg(2) as (t integer, total double precision, query double precision, autovacuum double precision, checkpoint double precision, bgwriter  double precision, wale  double precision, pgbouncer double precision, other double precision)) to '/tmp/reads-02s.csv' csv header

\copy (select * from reads_avg(10) as (t integer, total double precision, query double precision, autovacuum double precision, checkpoint double precision, bgwriter  double precision, wale  double precision, pgbouncer double precision, other double precision)) to '/tmp/reads-10s.csv' csv header


create materialized view iotop_writes as select t2 as t
        ,sum(write)/1024 as total
        ,coalesce(sum(write) filter (where what = 'query')/1024, 0) as query
        ,coalesce(sum(write) filter (where what = 'autovacuum')/1024, 0) as autovacuum
        ,coalesce(sum(write) filter (where what = 'checkpoint')/1024, 0) as checkpoint
        ,coalesce(sum(write) filter (where what = 'bgwriter')/1024, 0) as bgwriter
        ,coalesce(sum(write) filter (where what = 'wal-e')/1024, 0) as wale
        ,coalesce(sum(write) filter (where what = 'wal-e gpg')/1024, 0) as gpg
        ,coalesce(sum(write) filter (where what = 'wal writer')/1024, 0) as walwriter
        ,coalesce(sum(write) filter (where what = 'pgbouncer')/1024, 0) as pgbouncer
        ,coalesce(sum(write) filter (where what = 'stats')/1024, 0) as stats
        ,coalesce(sum(write) filter (where what = 'other')/1024, 0) as other
from iotop_agg group by t2;

\copy (select * from iotop_writes order by t asc) to '/tmp/writes-01s.csv' csv header

-- Percentage of read I/O on spikes over 800:
  -- read I/O on spikes
with read_totals as (select sum(read) as read from iotop group by t2), read_totals_over_800 as (select case when read > 800*1024 then read - 800*1024 else 0 end as read from read_totals where read >= 800*1024) select sum(read)/1024 from read_totals_over_800;
  -- Total read I/O
select sum(read)/1024 from iotop;


-- Writes

create function writes_avg(int) returns setof record AS $$
        with
                io2 as (select t/$1 as t2, * from iotop_writes),
                io2_agg as (
                        select t2, row_number() over w as row
                        , sum(total) over w / $1 as total
                        , sum(query) over w / $1 as query
                        , sum(autovacuum) over w / $1 as autovacuum
                        , sum(checkpoint) over w / $1 as checkpoint
                        , sum(bgwriter) over w / $1 as bgwriter
                        , sum(wale) over w / $1 as wale
                        , sum(gpg) over w / $1 as gpg 
                        , sum(walwriter) over w / $1 as walwriter
                        , sum(pgbouncer) over w / $1 as pgbouncer
                        , sum(stats) over w / $1 as stats
                        , sum(other) over w / $1 as other
                        from io2
                        window w as (order by t2 rows between $1 - 1 preceding and current row)
                )
        select t2, total, query, autovacuum, checkpoint, bgwriter, wale, gpg, walwriter, pgbouncer, stats, other
        from io2_agg where row % $1 = ($1 - 1) order by t2 asc
$$ language sql;

\copy (select * from writes_avg(10) as (t integer, total double precision, query double precision, autovacuum double precision, checkpoint double precision, bgwriter double precision, wale double precision, gpg double precision, walwriter double precision, pgbouncer double precision, stats double precision, other double precision)) to '/tmp/writes-10s.csv' csv header

\copy (
        with totals_gb as (
                select  sum(total)/1024 as total, sum(query)/1024 query, sum(autovacuum)/1024 autovacuum,
                        sum(checkpoint)/1024 "checkpoint", sum(bgwriter)/1024 bgwriter, sum(wale)/1024 wale,
                        sum(gpg)/1024 gpg, sum(walwriter)/1024 walwriter,
                        sum(pgbouncer)/1024 pgbouncer, sum(stats)/1024 stats, sum(other)/1024 other
                from iotop_writes
        )
        select total, query, autovacuum, "checkpoint", bgwriter, wale, gpg, pgbouncer, stats, other from totals_gb
        union select total / total, query / total, autovacuum/ total, "checkpoint"/ total, bgwriter/ total, wale/ total,
                        gpg / total, pgbouncer/ total, stats / total, other / total from totals_gb
) to '/tmp/write_totals.csv' csv header


-- Simulate reads with autovacuum to 1/3 of current load
begin;
update iotop set read=read/3 where what = 'autovacuum';
refresh materialized view iotop_agg;
refresh materialized view iotop_reads;
\copy (select * from iotop_reads order by t asc) to '/tmp/reads-01s-autovacuum_33p.csv' csv header
rollback;

-- Simulate writes with no gpg
begin;
update iotop set write=0 where what = 'wal-e gpg';
refresh materialized view iotop_agg;
refresh materialized view iotop_writes;
\copy (select * from iotop_writes order by t asc) to '/tmp/writes-01s-no_gpg.csv' csv header
rollback;

Generated files

Edited Aug 06, 2018 by Alvaro Hernandez