Database Capacity and Saturation Analysis (Iteration 1)

In this issue we evaluate current database capacity and saturation. Providing a time estimate for reaching saturation is difficult, as we can't predict future usage, but it does afford us some guidance in terms of headroom and we can make some guesstimates on time frames at current our growth rate. We will continue to monitor, analyze and update this report regularly, sharing our findings at the Infra Call. We also intend to perform a detailed load benchmark against the current database setup to fully understand where the saturation points are.

Conclusions

Considering the analyzed metrics, the primary database is executing between 40K-50K TPS during weekly peak times, with a load average range from 20 to 35. We have experienced some higher spikes, and these need to be addressed, as they're a real concern to the reliability and stability of the database. The next iteration of this analysis will highlight these peaks and will aim to bring them to Performance and Availability.

Given 3-month timeframe we have analyzed, we have a high-degree of confidence that the current architecture is in good shape to handle our needs over the next 12 months. This does not imply that the saturation wall is in 12 months. The current architecture could support an average load reaching 70 points and a throughput of 70k-80k TPS. The growth in both areas is under 10% during the last 3 months.

Additionally, we still have tuning options at our disposal to continue beyond the current saturation points: iwe can expect some gains through the memory setup of shared_buffer, at the moment this would be minimal but in case of degradation of memory access still could be tunable); in some cases, a hardware upgrade would also be another options (the idea would be to consider n2 type hosts that could have up to 224 vCPUs. Some of those actions have not been executed because it would require a restart from the cluster or would represent an extra cost that is not required on the moment, i.e., an increase the number of cores on the CPU architecture or increase the maximum number of connections. A more detailed explanation of scenarios with possible problems and mitigations are explained in the Risks section in this issue.

About the disk I/O metrics, even with a high usage saturation, the IOPS from reads and writes from disks are low in comparison with the GCP hardware limits. Also, the volume of data that we are receiving and transmitting on our network interfaces do not represent a risk.

Our pgbouncer core saturation is at a level under 30% (excluding peaks), and we could increase the number of nodes and increase the pool size since we have a higher capacity on the PostgreSQL side.

The disk usage keeps growing, the initiatives for pg-repack the database should reduce the bloated level, and the migration of the table migration external_merge_requests to a different data storage, combined we could reach a reduction of 25% to 35% of the data volume present today.

Considering the saturating metrics and resources used on the average of the growth during the last 3 months:

The TPS is in the range the 20k-45k, the actual hardware should support between 70k to 80k.
- [NOTE] The TPS is co-related with the CPU utilization/average load, since we have more queries at the same milliseconds more cores will be used after will start queuing. Also here we are estimating based on the pg_Stat_statements analysis on the number of calls, blocks reads, type of queries, time consumed, and data from there. Caveats: if the queries change the evaluation changes too, also this depends substantially on the Hardware - it is related on how much we can process now ... if we have load ~ 30 with 40k probably in a linear analysis we can reach 60 with 80k, providing some extra head for error.
The average load is between 15 and 30 points and could grow until 70 - 80. Similar numbers apply (anecdotally) for the DB CPU utilization.
- [NOTE] Beyond ~ 100 average load the db will get into much context switching - and it won’t be able to process most of the queries (we’ve seen instances when queries by ID that are resolved in milliseconds would have taken minutes). In general, we can say that around 100 is a risk in PostgreSQL "culture", hence we are conservative to estimate that we can reach 70 -80. And of course, it depends again on the hardware and OS.
The disk IO saturation is over 80% but analyzing in detail, the metrics from :
- Disk OPS reads reaches an average of 6K. The maximum supported by our disk is 75K
- Disk OPS writes reaches an average of 6K - 8k. The maximum supported by our disk is 30K.
The Postgresql memory caching is hitting over 99% in memory. That indicates the setup for memory it is adequate.
- [NOTE] Based on the data on the pg_Stat_Statements documentation: https://www.postgresql.org/docs/11/pgstatstatements.html. we summarize all shared_blks_hit ( basically are the memory blocks that are read in cache) and shared_blks_read ( basically are the memory blocks that are read from disk ) from all the SELECT statements executed on the primary.
The network capacity is used on average at a 20-30% percent.

Peaks

The spikes are events that happen randomly and being triggered from different sources as abuse incidents, bugs, or when or solution reaches a high level of saturation generating a high consumption of database resources from the database ecosystem in some scenarios.

The following issue is an investigation, where several teams are involved and trying to address the root cause : https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9396

Risks

Risk	Details	Realized?
High processor Load saturation / TPS reaching limits / higher context switching	In case of any of the issues mentioned, a solution would be upgrading the hardware from the database cluster to architecture with more cores ( could be the n2 type hosts on GCP), at the moment our current setup is with 96 cores. Also, it is possible initially to increase the number of parallel connections that are allowed to the database, currently, the setup is 300 and it would be possible to increase to 500, having room for improving our process capacity.
Pgbouncer saturation	In case we having a saturation of PGbouncer since this component is the single-threaded, it would be possible to rebalance the pools, adding more instances to use more cores, or increase the number of cores per instance( available in the current PGBouncer version). Also, we could use the increase in max_connections on the PostgreSQL side and increase the number of instances that are part of the pool.
Memory saturation or increased hits on disk	In case that our hot data set, that is the data used frequently starts to be higher than the allocated with the actual buffer_cache, It would be possible increase the buffer cache to increase the efficiency or add more memory to our database instances.
the performance of the disks for logging is degraded	In case of higher performance degradation of on the disks sda and sdc, it would be possible to improve the performance using SSDs disks. Increasing the log level would gather a more realistic workload for the database cluster.
Some queries have a bad performance, consuming a high amount of resources from the database cluster	These queries should be addressed with the help of the engineering teams, for the performance improvement.