Investigate Kubernetes workload reliability
During a full cluster nodepool rotation in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12414#note_938765380, the following potential issues were found:
-
Regional:
-
HPA MinPods < 3 -
Missing PDBs (e.g. thanos) -
PDB maxUnavailable too small (e.g. 1 for 150 pods), use percentage >= 4% -
gitlab-sidekiq-urgent-other-v2HPA MinPods = 175 out of 175 max -
gitlab-sidekiq-imports-v2HPA MinPods = 10 out of 10 max -
gitlab-sidekiq-catchall-v2HPA MinPods = 56 out of 300 max -
gitlab-mailroomHPA MaxPods 2 ? -
gitlab-sidekiq-database-throttled-v2HPA MinPods = 1 out of 1 max
-
-
Zonal:
-
Nginx minAvailable of 2 with HPA running >= 20 pods -
PDB maxUnavailable too small (e.g. 1 for 150 pods), use percentage >= 4% -
gitlab-shellHPA MinPods = 100 out of 150 max -
gitlab-registryHPA MinPods = 50 out of 90 max -
gitlab-webservice-websocketsHPA MinPods = 2 -
gitlab-gitlab-pagesHPA MinPods = 2
-
Desired outcome
-
Review all HPAs -
Review all PDBs -
Create MRs to improve behavior
Edited by Filipe Santos