Postgres 12: observed performance regressions

Currently, the target database has the setup as we designed for production:

  • PG12
  • PG12 Setup
  • OS Ubuntu 18
  • Hardware AMD Epic Rome ( currently 128 vCPUs)

Our source cluster is similar to the current production.

We had the following results:

jfinotto@jmeter-01-inf-db-benchmarking.c.gitlab-db-benchmarking.internal:~/db-migration/benchmark/bin$ ./run-bench.sh -h pgbouncer.service.consul -d gitlabhq_production_pg12ute_source -U gitlab-superuser -p 6432 -e prd -t postgres-benchmark-bytime-final.jmx -j 30 -T 60 -r result0406_source_time_testf005.csv
Creating summariser <summary>
Created the tree successfully using plan/postgres-benchmark-bytime-final.jmx
Starting standalone test @ Tue Apr 06 16:53:05 UTC 2021 (1617727985939)
Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445
summary +   4141 in 00:00:24 =  176.1/s Avg:   133 Min:     0 Max: 17734 Err:     0 (0.00%) Active: 30 Started: 30 Finished: 0
summary +   5121 in 00:00:30 =  170.7/s Avg:   176 Min:     0 Max: 23040 Err:     0 (0.00%) Active: 30 Started: 30 Finished: 0
summary =   9262 in 00:00:54 =  173.0/s Avg:   156 Min:     0 Max: 23040 Err:     0 (0.00%)
summary +   1266 in 00:00:16 =   77.2/s Avg:   336 Min:     0 Max: 28910 Err:     1 (0.08%) Active: 0 Started: 30 Finished: 30
summary =  10528 in 00:01:10 =  150.5/s Avg:   178 Min:     0 Max: 28910 Err:     1 (0.01%)
Tidying up ...    @ Tue Apr 06 16:54:16 UTC 2021 (1617728056411)
... end of run
jfinotto@jmeter-01-inf-db-benchmarking.c.gitlab-db-benchmarking.internal:~/db-migration/benchmark/bin$ ./run-bench.sh -h pgbouncer.service.consul -d gitlabhq_production_pg12ute_target -U gitlab-superuser -p 6432 -e prd -t postgres-benchmark-bytime-final.jmx -j 30 -T 60 -r result0406_target_time_testf005.csv
Creating summariser <summary>
Created the tree successfully using plan/postgres-benchmark-bytime-final.jmx
Starting standalone test @ Tue Apr 06 16:54:37 UTC 2021 (1617728077815)
Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445
summary +   2628 in 00:00:22 =  121.2/s Avg:   150 Min:     0 Max: 18850 Err:     0 (0.00%) Active: 30 Started: 30 Finished: 0
summary +   1463 in 00:00:31 =   47.8/s Avg:   283 Min:     0 Max: 37592 Err:     0 (0.00%) Active: 30 Started: 30 Finished: 0
summary =   4091 in 00:00:52 =   78.2/s Avg:   197 Min:     0 Max: 37592 Err:     0 (0.00%)
summary +    191 in 00:01:05 =    3.0/s Avg:  3160 Min:     0 Max: 95057 Err:     0 (0.00%) Active: 17 Started: 30 Finished: 13
summary =   4282 in 00:01:57 =   36.6/s Avg:   330 Min:     0 Max: 95057 Err:     0 (0.00%)
summary +      7 in 00:01:28 =    0.1/s Avg: 149608 Min: 93354 Max: 201242 Err:     0 (0.00%) Active: 10 Started: 30 Finished: 20
summary =   4289 in 00:03:25 =   20.9/s Avg:   573 Min:     0 Max: 201242 Err:     0 (0.00%)
summary +      5 in 00:00:27 =    0.2/s Avg: 191537 Min: 163211 Max: 231412 Err:     0 (0.00%) Active: 5 Started: 30 Finished: 25
summary =   4294 in 00:03:52 =   18.5/s Avg:   796 Min:     0 Max: 231412 Err:     0 (0.00%)
summary +      3 in 00:01:05 =    0.0/s Avg: 276280 Min: 265850 Max: 286316 Err:     0 (0.00%) Active: 2 Started: 30 Finished: 28
summary =   4297 in 00:04:57 =   14.5/s Avg:   988 Min:     0 Max: 286316 Err:     0 (0.00%)
result0406_target_time_testf005.csvsummary +      1 in 00:02:49 =    0.0/s Avg: 454381 Min: 454381 Max: 454381 Err:     0 (0.00%) Active: 0 Started: 30 Finished: 30
summary =   4298 in 00:07:47 =    9.2/s Avg:  1093 Min:     0 Max: 454381 Err:     0 (0.00%)
Tidying up ...    @ Tue Apr 06 17:02:24 UTC 2021 (1617728544968)
... end of run

In target took around 8 min the test designed for 1 min due to possible performance regression.

Another test was executed with the statement_timeout with 15 seconds (as it is in production).

jfinotto@jmeter-01-inf-db-benchmarking.c.gitlab-db-benchmarking.internal:~/db-migration/benchmark/bin$ ./run-bench.sh -h pgbouncer.service.consul -d gitlabhq_production_pg12ute_target -U gitlab-superuser -p 6432 -e prd -t postgres-benchmark-bytime-final.jmx -j 30 -T 60 -r result0406_target_time_testf004.csv
Creating summariser <summary>
Created the tree successfully using plan/postgres-benchmark-bytime-final.jmx
Starting standalone test @ Tue Apr 06 16:44:50 UTC 2021 (1617727490488)
Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445
summary +   1407 in 00:00:09 =  157.4/s Avg:   112 Min:     0 Max:  6204 Err:     0 (0.00%) Active: 30 Started: 30 Finished: 0
summary +   2858 in 00:00:30 =   94.6/s Avg:   304 Min:     0 Max: 15607 Err:    29 (1.01%) Active: 30 Started: 30 Finished: 0
summary =   4265 in 00:00:39 =  108.9/s Avg:   240 Min:     0 Max: 15607 Err:    29 (0.68%)
cacsummary +   1350 in 00:00:30 =   44.7/s Avg:   614 Min:     0 Max: 15852 Err:    30 (2.22%) Active: 7 Started: 30 Finished: 23
summary =   5615 in 00:01:09 =   81.0/s Avg:   330 Min:     0 Max: 15852 Err:    59 (1.05%)
summary +      6 in 00:00:06 =    1.1/s Avg: 15337 Min: 15001 Max: 15836 Err:     6 (100.00%) Active: 0 Started: 30 Finished: 30
summary =   5621 in 00:01:15 =   75.0/s Avg:   346 Min:     0 Max: 15852 Err:    65 (1.16%)
Tidying up ...    @ Tue Apr 06 16:46:06 UTC 2021 (1617727566048)
... end of run
jfinotto@jmeter-01-inf-db-benchmarking.c.gitlab-db-benchmarking.internal:~/db-migration/benchmark/bin$ ./run-bench.sh -h pgbouncer.service.consul -d gitlabhq_production_pg12ute_source -U gitlab-superuser -p 6432 -e prd -t postgres-benchmark-bytime-final.jmx -j 30 -T 60 -r result0406_source_time_testf004.csv
Creating summariser <summary>
Created the tree successfully using plan/postgres-benchmark-bytime-final.jmx
Starting standalone test @ Tue Apr 06 16:46:47 UTC 2021 (1617727607006)
Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445
summary +   2379 in 00:00:12 =  191.3/s Avg:   110 Min:     0 Max:  9099 Err:     0 (0.00%) Active: 30 Started: 30 Finished: 0
summary +   5134 in 00:00:30 =  171.5/s Avg:   166 Min:     0 Max: 15729 Err:    17 (0.33%) Active: 30 Started: 30 Finished: 0
summary =   7513 in 00:00:42 =  177.3/s Avg:   148 Min:     0 Max: 15729 Err:    17 (0.23%)
summary +   3274 in 00:00:29 =  112.6/s Avg:   230 Min:     0 Max: 15354 Err:    18 (0.55%) Active: 0 Started: 30 Finished: 30
summary =  10787 in 00:01:11 =  151.0/s Avg:   173 Min:     0 Max: 15729 Err:    35 (0.32%)
Tidying up ...    @ Tue Apr 06 16:47:59 UTC 2021 (1617727679074)
... end of run

The error rate is higher on the target based on timeout disconnetion.

I will add more details and the reports.

The repo used with the info from the queries and the workloads is the following:

https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/benchmark
Edited by Gerardo Lopez-Fernandez