Investigate Kubernetes workload reliability

During a full cluster nodepool rotation in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12414#note_938765380, the following potential issues were found:

Regional:
- HPA MinPods < 3
- Missing PDBs (e.g. thanos)
- PDB maxUnavailable too small (e.g. 1 for 150 pods), use percentage >= 4%
- gitlab-sidekiq-urgent-other-v2 HPA MinPods = 175 out of 175 max
- gitlab-sidekiq-imports-v2 HPA MinPods = 10 out of 10 max
- gitlab-sidekiq-catchall-v2 HPA MinPods = 56 out of 300 max
- gitlab-mailroom HPA MaxPods 2 ?
- gitlab-sidekiq-database-throttled-v2 HPA MinPods = 1 out of 1 max
Zonal:
- Nginx minAvailable of 2 with HPA running >= 20 pods
- PDB maxUnavailable too small (e.g. 1 for 150 pods), use percentage >= 4%
- gitlab-shell HPA MinPods = 100 out of 150 max
- gitlab-registry HPA MinPods = 50 out of 90 max
- gitlab-webservice-websockets HPA MinPods = 2
- gitlab-gitlab-pages HPA MinPods = 2

Desired outcome

Review all HPAs
Review all PDBs
Create MRs to improve behavior

Edited Jun 16, 2022 by Filipe Santos

Assignee Loading

Time tracking Loading