Investigate Kubernetes workload reliability

During a full cluster nodepool rotation in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12414#note_938765380, the following potential issues were found:

  • Regional:

    • HPA MinPods < 3
    • Missing PDBs (e.g. thanos)
    • PDB maxUnavailable too small (e.g. 1 for 150 pods), use percentage >= 4%
    • gitlab-sidekiq-urgent-other-v2 HPA MinPods = 175 out of 175 max
    • gitlab-sidekiq-imports-v2 HPA MinPods = 10 out of 10 max
    • gitlab-sidekiq-catchall-v2 HPA MinPods = 56 out of 300 max
    • gitlab-mailroom HPA MaxPods 2 ?
    • gitlab-sidekiq-database-throttled-v2 HPA MinPods = 1 out of 1 max
  • Zonal:

    • Nginx minAvailable of 2 with HPA running >= 20 pods
    • PDB maxUnavailable too small (e.g. 1 for 150 pods), use percentage >= 4%
    • gitlab-shell HPA MinPods = 100 out of 150 max
    • gitlab-registry HPA MinPods = 50 out of 90 max
    • gitlab-webservice-websockets HPA MinPods = 2
    • gitlab-gitlab-pages HPA MinPods = 2

Desired outcome

  • Review all HPAs
  • Review all PDBs
  • Create MRs to improve behavior
Edited by Filipe Santos