Investigate Review App Failures
Description
Review app performance in review-apps-ce
review-apps-ee
has a success rate of under 10% for the last few business days.
Issue summary
CPU usage on nodes are maxing out at 100% which indicates requests do not align with pod needs. This would also limit the effectiveness of autoscale on GKE as it also looks at requests.
Contributing factors
A single root cause has not been identified but these are the factors that contributed to the increase spike in resource usage:
- https://gitlab.com/gitlab-org/gitlab-ee/issues/26893 - This caused a normally daily cleanup task to not run and fail silently. It is how we ended up with orphaned pods and stale releases consuming nodes. This is what I would attribute to the primary root cause
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/32783 - Limit adjustments based on usage for apps within the chart. As load on the nodes increased we would be seeing regular healthcheck failures and load averages which were very high. This occurred with gitlab-exporter gitaly and nginx-ingress-controller nodes
Timeline
CPU usage seemed to have gone above 90% on 2019-09-05, while it normally start to go down on Thursdays at that time:
Group size followed the same trend:
Edited by Kyle Wiebers