Consider upscaling VMs with a single vCPU in production to decrease sensitivity to saturation

We've recently seen at least two incidents where single-core VMs have shown periodical patterns of saturation roughly matching chef-repo runs cadence:

  • production#6238 (closed) sd-exporter-01-inf-gprd is down: This caused a myriad of alerts from the exporter going dark. We already upscaled this VM to mitigate the incident
  • production#6230 (closed) Flaky git service SLI: This caused intermittent interruptions to feature flag changes and deployments. In this case it's not obvious that a simple upscale would fix the latency issue, but we did notice the periodic pattern of CPU stress. The VMs remain single-core.

One of the avenues of investigation is the saturation that chef-client runs are (perhaps unnecessarily) incurring on the nodes (see https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15083), but we might want to consider discouraging single-core VMs and upscaling our current ones, so that we don't have such a limited breathing room where otherwise menial CPU work might cause monitoring noise.

Currently the following machines are single core (n1-standard-1 1vCPU, 3.75GB of RAM):

  • camoproxy-01-sv-gprd
  • camoproxy-02-sv-gprd
  • redis-cache-sentinel-01-db-gprd
  • redis-cache-sentinel-02-db-gprd
  • redis-cache-sentinel-03-db-gprd

Upscaling these machines would give us more wiggle room so that if we have a resource intensive task, even something rutinary as chef-client runs we don't have an exaggerated effect on the machines.

Edited by Alejandro Rodríguez