Skip to content
GitLab
Next
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • T team-tasks
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Issues 445
    • Issues 445
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Insights
    • Issue
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Create a new issue
  • Issue Boards
Collapse sidebar
  • GitLab.orgGitLab.org
  • Quality DepartmentQuality Department
  • team-tasks
  • Issues
  • #212
Closed
Open
Issue created Sep 06, 2019 by Kyle Wiebers@kwiebersOwner

Investigate Review App Failures

Description

Review app performance in review-apps-cereview-apps-ee has a success rate of under 10% for the last few business days.

Issue summary

CPU usage on nodes are maxing out at 100% which indicates requests do not align with pod needs. This would also limit the effectiveness of autoscale on GKE as it also looks at requests.

Contributing factors

A single root cause has not been identified but these are the factors that contributed to the increase spike in resource usage:

  • https://gitlab.com/gitlab-org/gitlab-ee/issues/26893 - This caused a normally daily cleanup task to not run and fail silently. It is how we ended up with orphaned pods and stale releases consuming nodes. This is what I would attribute to the primary root cause
  • https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/32783 - Limit adjustments based on usage for apps within the chart. As load on the nodes increased we would be seeing regular healthcheck failures and load averages which were very high. This occurred with gitlab-exporter gitaly and nginx-ingress-controller nodes

Timeline

CPU usage seemed to have gone above 90% on 2019-09-05, while it normally start to go down on Thursdays at that time:

Screen_Shot_2019-09-11_at_17.04.39

Group size followed the same trend:

Screen_Shot_2019-09-11_at_17.05.04

Edited Sep 13, 2019 by Kyle Wiebers
Assignee
Assign to
Time tracking