Consider upscaling VMs with a single vCPU in production to decrease sensitivity to saturation
We've recently seen at least two incidents where single-core VMs have shown periodical patterns of saturation roughly matching chef-repo runs cadence:
- production#6238 (closed) sd-exporter-01-inf-gprd is down: This caused a myriad of alerts from the exporter going dark. We already upscaled this VM to mitigate the incident
- production#6230 (closed) Flaky git service SLI: This caused intermittent interruptions to feature flag changes and deployments. In this case it's not obvious that a simple upscale would fix the latency issue, but we did notice the periodic pattern of CPU stress. The VMs remain single-core.
One of the avenues of investigation is the saturation that chef-client runs are (perhaps unnecessarily) incurring on the nodes (see https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15083), but we might want to consider discouraging single-core VMs and upscaling our current ones, so that we don't have such a limited breathing room where otherwise menial CPU work might cause monitoring noise.
Currently the following machines are single core (n1-standard-1 1vCPU, 3.75GB of RAM):
- camoproxy-01-sv-gprd
- camoproxy-02-sv-gprd
- redis-cache-sentinel-01-db-gprd
- redis-cache-sentinel-02-db-gprd
- redis-cache-sentinel-03-db-gprd
Upscaling these machines would give us more wiggle room so that if we have a resource intensive task, even something rutinary as chef-client runs we don't have an exaggerated effect on the machines.