Evaluate new N2/N2D hardware with real production Workload

Overview

In order to evaluate the hardware with real production workload, we'll deploy 1 node of each new proposed hardware as a Read Replica in our production environments (GSTG then GPRD). This will allow us to evaluate real production workload and compare performance and resource usage metrics between nodes with the old hardware and the new hardware.

The target hardware for our Patroni Main cluster in GPRD is n2-highmem-128, and for the Patroni CI cluster we'll test the n2-highmem-96 (see &851 (comment 1319975492)). But we should evaluate also the n2d-standard-224 in the Main cluster for comparison purposes.

In GSTG the patroni hardware currently is n1-standard-8 for all clusters, therefore the target hardware for GSTG will be n2-standard-8.

Plan

Deployment and evaluation plan for the Main and CI clusters

Deploy 1x n2-standard-8 and 1x n2d-standard-8 node in the patroni-main-2004 cluster and 1x n2-standard-8 node in the patroni-ci-2004 cluster in the GSTG environment, to validate the deployment process;
- CR: production#8569 (closed)
Deploy 1x n2-highmem-128 and 1x n2d-standard-224 node in the patroni-main-2004 cluster and 1x n2-highmem-96 node in the patroni-ci-2004 cluster in the GPRD environment
- CR: production#8576 (closed)
Compare performance metrics between n1-highmem-96, n2-highmem-128 and n2d-standard-224 nodes in GPRD (for at least 1 week);
Remove 2x n1-highmem-96 nodes from the patroni-main-2004 cluster and 1x n1-highmem-96 node from the patroni-cli-2004 cluster in GPRD
- CR1: production#8648 (closed) - rolled back for patroni-main-2004
- CR2: production#8694 (closed) - rolled back for patroni-main-2004 as is not possible to have uneven workload load balancing and there's risk of saturating old N1 nodes
Compare performance metrics between n1-highmem-96, n2-highmem-128 and n2d-standard-224 nodes in GPRD (for at least 1 week);

Evaluation Metrics

The following metrics will be evaluated:

TPS
- For the comparison to be fair, the compared nodes should be receiving the same amount of TPS;
Avg queries response time
Volume of slow queries being logged
CPU usage
1. Also check CPU usage by state (user, system, iowait, idle)
CPU load
Memory footprint
1. Mem free/used/cached
2. Swapping
3. PG Cache hit
4. PG Cache miss
I/O footprint
1. I/O Throughput
2. IOPS
3. I/O wait
PG Wait Events (if we could sample it)

PS: other metrics might be added to the above list

cc @alexander-sosna @bshah11 @kwanyangu

Edited Apr 18, 2023 by Rafael Henchen