Skip to content

GitLab Next

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
delivery
delivery
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 355
    • Issues 355
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
    • Iterations
  • Requirements
    • Requirements
    • List
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Insights
    • Issue
    • Value Stream
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Issue Boards
  • GitLab.com
  • GitLab Infrastructure Team
  • deliverydelivery
  • Issues
  • #1358

Closed
Open
Opened Nov 19, 2020 by John Skarbek@skarbekOwner2 of 5 tasks completed2/5 tasks

Investigate git deployment problems leading to 502's and 503's

During 2 consecutive deployments to production, we ran into SLI degradation of the error rate. We were violating the 1 hour SLO for the entire length of time of which the git fleet were being deployed too. This was captured by the following chart:

image

Source

We can see this also impacts our Kubernetes deployments, but not as severly. The difference here is that Kuberentes deployments operate in a much significantly different method.

image

Source

Deployments of questionable activity:

  • https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/348891
  • https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/349356

Utilize this issue to determine why we may be receiving such a high rate of HTTP502's and 503's during a deployment. This behavior is new and we've not experienced deployments this rough for quite awhile.

Milestones

  • Determine why the deployment is "rough"
    • Any configuration changes recently that allowed a different expectation from workhorse?
    • Are the nodes being properly drained from haproxy?
    • Are existing connections not being closed properly?
  • ...

Results of work

We learned the behavior of the nginx-ingress during the investigation of this issue. Nginx was hanging onto connections to Pods that were to be Terminated allowing those Pods to continue to serve traffic. Details about this are in this thread: #1358 (comment 459061429)

We worked to tune nginx to drop those connections quickly. This resolved our issue as noted here: #1358 (comment 461826610)

/cc @gitlab-org/delivery

Edited Dec 09, 2020 by John Skarbek
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: gitlab-com/gl-infra/delivery#1358