Evaluate new N2/N2D hardware with real production Workload
Overview
In order to evaluate the hardware with real production workload, we'll deploy 1 node of each new proposed hardware as a Read Replica in our production environments (GSTG then GPRD). This will allow us to evaluate real production workload and compare performance and resource usage metrics between nodes with the old hardware and the new hardware.
The target hardware for our Patroni Main cluster in GPRD is n2-highmem-128
, and for the Patroni CI cluster we'll test the n2-highmem-96
(see &851 (comment 1319975492)). But we should evaluate also the n2d-standard-224
in the Main cluster for comparison purposes.
In GSTG the patroni hardware currently is n1-standard-8
for all clusters, therefore the target hardware for GSTG will be n2-standard-8
.
Plan
Deployment and evaluation plan for the Main and CI clusters
- Deploy 1x
n2-standard-8
and 1xn2d-standard-8
node in thepatroni-main-2004
cluster and 1xn2-standard-8
node in thepatroni-ci-2004
cluster in the GSTG environment, to validate the deployment process; - Deploy 1x
n2-highmem-128
and 1xn2d-standard-224
node in thepatroni-main-2004
cluster and 1xn2-highmem-96
node in thepatroni-ci-2004
cluster in the GPRD environment - Compare performance metrics between n1-highmem-96, n2-highmem-128 and n2d-standard-224 nodes in GPRD (for at least 1 week);
- Remove 2x
n1-highmem-96
nodes from thepatroni-main-2004
cluster and 1xn1-highmem-96
node from thepatroni-cli-2004
cluster in GPRD- CR1: production#8648 (closed) - rolled back for
patroni-main-2004
- CR2: production#8694 (closed) - rolled back for
patroni-main-2004
as is not possible to have uneven workload load balancing and there's risk of saturating old N1 nodes
- CR1: production#8648 (closed) - rolled back for
- Compare performance metrics between n1-highmem-96, n2-highmem-128 and n2d-standard-224 nodes in GPRD (for at least 1 week);
Evaluation Metrics
The following metrics will be evaluated:
- TPS
- For the comparison to be fair, the compared nodes should be receiving the same amount of TPS;
- Avg queries response time
- Volume of slow queries being logged
- CPU usage
- Also check CPU usage by state (user, system, iowait, idle)
- CPU load
- Memory footprint
- Mem free/used/cached
- Swapping
- PG Cache hit
- PG Cache miss
- I/O footprint
- I/O Throughput
- IOPS
- I/O wait
- PG Wait Events (if we could sample it)
PS: other metrics might be added to the above list