Skip to content
GitLab
Next
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • GitLab GitLab
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 44,767
    • Issues 44,767
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 1,330
    • Merge requests 1,330
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GitLab.orgGitLab.org
  • GitLabGitLab
  • Issues
  • #228954
Closed
Open
Issue created Jul 14, 2020 by Jennifer Louie@jennielouie🔴Contributor

Geo multi-node deployment upgrade: investigate order when upgrading non-deploy nodes

Currently in the zero-downtime upgrade instructions for multi-node Geo deployments, we do not specify a specific order when upgrading non-deploy (non-Gitaly) nodes. This issue investigates how the order of upgrading non-deploy (non-Gitaly) nodes impacts downtime (500 errors, readiness checks, failed end-to-end tests). Specifically, check if upgrading Sidekiq nodes before Rails web nodes reduces/eliminates errors seen when the opposite order is used.

This investigation is separate from looking at downtime during reconfigure and hot reload of Web nodes, which is covered here.

Background: During an upgrade from 12.10.12 to 13.0.10 of multi-node Geo deployment when nodes were upgraded one-by-one, we observed 500 errors after one of two Rails nodes on the Primary site was upgraded and reconfigured, but before the one online Sidekiq node was upgraded (the other Sidekiq node was the deploy node and was not handling requests). We did not observe failures in the readiness checks.

The 500 failures were related to creating a project via API:

 Failure/Error:
       @project = Resource::Project.fabricate_via_api! do |project|
         project.name = 'project'
       end
     
     QA::Resource::ApiFabricator::InternalServerError:
       Failed to GET http://gjsl9-primary.gogitlab.ml/api/v4/groups/looping-pipeline?private_token=[****] - (500)

For the Secondary site upgrade, after Gitaly node and deploy node we upgraded the online Sidekiq node first, and then the Rails nodes in tandem. We did not observe any 500 errors or readiness check failures.

These observations don't prove any causation but prompted this issue to explore further.

If the order of upgrading non-deploy nodes does impact downtime, we should update our instructions accordingly.

Edited Jul 14, 2020 by Jennifer Louie
Assignee
Assign to
Time tracking