Capacity assessment for our main production DB cluster

added Database workflow-infraTriage + 1 deleted label

changed the description

This could be a rough idea/initial approach to this work:

Rationale: As we increase the query load (req/s) on this leader server, the response times will start increasing progressively, bu the service times will continue to be similar for a good while, till they eventually increase dramatically.

I know there is a lot of work already done by Ongres in the benchmarking issue that we can reuse here.

CC/ @Finotto , @alejandro , @hphilipps , @ahmadsherif .

@albertoramos I think we should consider adding pgbouncers in the topology to test, between the Jmeter and the PostgreSQL database. It will reflect more realistic the production environment.

Done

The other main concern we just spoke about is getting the right data (as close to production as possible) for this Capacity assessment... Which will take some time (compliance/security concerns)

changed title from Capacity planning of our main production DB cluster to Capacity assessment for our main production DB cluster

mentioned in epic gitlab-org&1854 (closed)

mentioned in issue #10340 (closed)

@meks might be an idea to use the K6 instead of Jmeter since you're working to make that a GitLab feature.

@sytses thank you for the ping. Yes we should align on the tooling.

@albertoramos for context, Quality has evaluated and used multiple performance tools (Jmeter,Artillery,K6) We currently are using K6 for our performance tests and also validating our reference architecture. In addition we are building load testing feature into GitLab that will provide K6 capabilities out of the box. Can we please kindly evaluate using K6 in the infrastructure workstream here?

References for more context:

K6 as a gitlab feature gitlab-org/gitlab#10683 (closed)
Existing performance testing of gitlab releases using K6 https://gitlab.com/gitlab-org/quality/performance
Reference architecture performance tested with K6
- https://docs.gitlab.com/ee/administration/reference_architectures/index.html

@tpazitny for your awareness and future collaboration

added workflow-infraReady label and removed workflow-infraTriage label

As an addition to @Finotto's analysis https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10340, I've quickly checked what the options of vertical scaling are on GCP. Currently, PostgreSQL machines are n1-highmem-96. Further, I analyze only the primary node since it's the most difficult to scale.

The data is very, very well cached, the effectiveness of the buffer pool (which can be easily increased in size, because now it's just 110 GiB) is >99%. The main limiting factor (as analyzed in details in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10340) is CPU.

There are options to do vertical scaling on GCP, namely. For n1-*** family (which is based on Intel Skylake/Broadwell/Haswell), the most powerful instance is already used, with 96 vCPUs. But GCP also provides n2d-*** and m2-**.

n2d-***are based on AMD EPYC Rome, the maximum is n2d-standard-224, with 224 vCPUs. Some benchmarking comparison of AMD Rome vs. Intel Cascade Lake: https://www.phoronix.com/scan.php?page=article&item=linux55-xeon-epyc&num=2.

To ensure that for Postgres workloads we should expect some gain if we switch to AMD EPYC Rome and 224 vCPUs, I've performed a couple of small benchmarks with standard pgbench utility, comparing n1 with 96 vCPUs (like on the current master) to n2d with 224 vCPUs (interesting, that all of them are available in "preemptible" mode, with prices like $1-2 per hour, very convenient options for benchmarks).

The result is (SELECT-only workload, which is CPU-bound; the details are below):

n1 with 96 vCPUs: 0.63M TPS
n2d with 224 vCPUs: 1.04M TPS.

This means that indeed, we could have more CPU power to perform vertical scaling if/when needed, and handle much higher workload on n2d compared to the current n1.

m2 instances are even more powerful, they advertise up to 416 vCPUs (new Intel), 11,776 GiB of RAM (announcement in GCP blog: https://cloud.google.com/blog/products/compute/introducing-compute-and-memory-optimized-vms-for-google-compute-engine). Unfortunately, it looks like it's not directly available in GCP console in any region, so I couldn't test it. But based on the blog post, this is a lot of CPU power, and current Postgres versions scale quite well for large amounts of CPU cores (unlike a few years ago), so if needed, these options can help handle the increasing workload and buy some more time if needed. With 416 vCPUs, we could expect to handle 3-4 more TPS than now without degradation. Of course, if this path seems important, it makes sense to check it thoroughly.

Below are the tech details of the benchmarks performed.

Scale and config adjustments:

100000000 of 100000000 tuples (100%) done (elapsed 104.98 s, remaining 0.00 s)

alter system set max_connections = 3000;
alter system set shared_buffers = '150GB';
alter system set effective_cache_size = '200GB';
alter system set max_wal_size ='100GB';
alter system set checkpoint_timeout = '100min';
alter system set work_mem = '100MB';

Detailed results for both read-only and default pgbench's r/w workloads:

# n2d with 224 vCPUs

$ pgbench -n -P30 -T120 -r -S -j224 -c224
progress: 30.0 s, 871030.5 tps, lat 0.253 ms stddev 0.383
progress: 60.0 s, 1096054.3 tps, lat 0.204 ms stddev 0.113
progress: 90.0 s, 1104329.0 tps, lat 0.203 ms stddev 0.077
progress: 120.0 s, 1105450.1 tps, lat 0.203 ms stddev 0.075
transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 224
number of threads: 224
duration: 120 s
number of transactions actually processed: 125306168
latency average = 0.214 ms
latency stddev = 0.193 ms
tps = 1041235.609955 (including connections establishing)
tps = 1044645.775367 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.002  \set aid random(1, 100000 * :scale)
         0.229  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

$ pgbench -n -P30 -T120 -r  -j224 -c224
progress: 30.0 s, 12720.9 tps, lat 17.568 ms stddev 13.966
progress: 60.0 s, 16389.5 tps, lat 13.668 ms stddev 12.058
progress: 90.0 s, 20807.6 tps, lat 10.764 ms stddev 10.643
progress: 120.0 s, 26994.3 tps, lat 8.295 ms stddev 8.056
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 224
number of threads: 224
duration: 120 s
number of transactions actually processed: 2307595
latency average = 11.645 ms
latency stddev = 11.296 ms
tps = 19219.693787 (including connections establishing)
tps = 19223.221581 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.004  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.149  BEGIN;
         0.511  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.200  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.418  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.322  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.218  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         8.833  END;


# n1 with 96 vCPUs

postgres@nik-pgbench-n1-tmp:~$ pgbench -n -P30 -T120 -r -j224 -c224  -S
progress: 30.0 s, 525707.1 tps, lat 0.425 ms stddev 0.176
progress: 60.0 s, 540858.6 tps, lat 0.414 ms stddev 0.112
progress: 90.0 s, 545726.0 tps, lat 0.410 ms stddev 0.111
progress: 120.0 s, 543147.8 tps, lat 0.412 ms stddev 0.111
transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 224
number of threads: 224
duration: 120 s
number of transactions actually processed: 64663454
latency average = 0.415 ms
latency stddev = 0.130 ms
tps = 538645.227077 (including connections establishing)
tps = 538939.256459 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.420  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
postgres@nik-pgbench-n1-tmp:~$
postgres@nik-pgbench-n1-tmp:~$ pgbench -n -P30 -T120 -r  -j224 -c224
progress: 30.0 s, 38775.7 tps, lat 5.553 ms stddev 7.192
progress: 60.0 s, 50323.5 tps, lat 4.451 ms stddev 4.137
progress: 90.0 s, 50674.2 tps, lat 4.420 ms stddev 4.547
progress: 120.0 s, 51195.9 tps, lat 4.375 ms stddev 3.713
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 224
number of threads: 224
duration: 120 s
number of transactions actually processed: 5729300
latency average = 4.647 ms
latency stddev = 4.941 ms
tps = 47721.160314 (including connections establishing)
tps = 48029.317259 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.002  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.108  BEGIN;
         0.270  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.194  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.253  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.549  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.178  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         3.092  END;
postgres@nik-pgbench-n1-tmp:~$
postgres@nik-pgbench-n1-tmp:~$
postgres@nik-pgbench-n1-tmp:~$ pgbench -n -P30 -T120 -r -j96 -c96  -S
progress: 30.0 s, 578306.3 tps, lat 0.164 ms stddev 0.074
progress: 60.0 s, 630212.3 tps, lat 0.152 ms stddev 0.052
progress: 90.0 s, 653320.6 tps, lat 0.147 ms stddev 0.051
progress: 120.0 s, 657760.7 tps, lat 0.146 ms stddev 0.051
transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 96
number of threads: 96
duration: 120 s
number of transactions actually processed: 75588114
latency average = 0.152 ms
latency stddev = 0.058 ms
tps = 629179.944865 (including connections establishing)
tps = 630381.217198 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.154  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
postgres@nik-pgbench-n1-tmp:~$
postgres@nik-pgbench-n1-tmp:~$ pgbench -n -P30 -T120 -r  -j96 -c96
progress: 30.0 s, 36004.5 tps, lat 2.653 ms stddev 2.218
progress: 60.0 s, 35011.6 tps, lat 2.742 ms stddev 2.324
progress: 90.0 s, 36708.7 tps, lat 2.615 ms stddev 2.314
progress: 120.0 s, 36217.3 tps, lat 2.648 ms stddev 2.389
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 96
number of threads: 96
duration: 120 s
number of transactions actually processed: 4318364
latency average = 2.665 ms
latency stddev = 2.322 ms
tps = 35975.841646 (including connections establishing)
tps = 36009.209318 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.002  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.073  BEGIN;
         0.199  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.140  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.156  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.217  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.118  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         1.758  END;

@NikolayS The N2's are problematic because in us-east1 they are only available in zones c and d, not us-east1-b. Our patroni servers are spread through those three zones in a round-robin fashion (01 => c, 02 => d, 03 => b, 04 => c, and so on); if we wanted to migrate to N2s we'd need to change our terraform config to avoid b (the simplest option being to put all the patroni nodes in one zone, which carries the obvious risks for availability). Rebuilds would be required at a minimum for the ones in b. To be clear, it's not impossible or even terribly difficult, it's just not as simple as shutting down a node, changing the machine type, and starting up again.

Sadly the n2d's and m2's are not available in any zone in us-east1.

https://cloud.google.com/compute/docs/regions-zones#available

changed the description

@NikolayS thank you for the info!

After we have defined a similar workload from production, I do agree on testing on different hardware to check our options.

changed milestone to %Datastores Team - July 2020

changed milestone to %Datastores Team - Aug 2020

removed workflow-infraReady label

removed milestone %Datastores Team - Aug 2020

Closing this for now, ready to reopen when the time and urgency are there. At the moment we are doing quite OK with this other alternative approach: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11390

closed

assigned to @Finotto

added workflow-infraDone label

Capacity assessment for our main production DB cluster

Designs

Child items ...

Activity