Skip to content

Change: reattempt running Network Gitaly in Production

Zoom Call: https://gitlab.zoom.us/j/882946101

Change Log: https://docs.google.com/document/d/1CSLx115Tt36ceBZ5EU-ZMDYv4-Qq978QcCUuiRP5XD4/edit

PLANNING THE CHANGE

For more background on when this template should be used, see the infrastructure handbook.

  • Context: What is the background of the change? Relevant links?

    • The Gitaly team are attempting to run Gitaly over the Network.
    • Full details are available in gitlab-org/gitaly#181 (closed)
    • This is a reattempt at https://gitlab.com/gitlab-com/infrastructure/issues/1777, which failed on Friday as the result of the NFS servers running the wrong version of Gitaly
    • A quick summary of the action is:
      • Start Gitaly on NFS servers
      • Reconfigure one Git host git01 to forward Gitaly requests to NFS servers
      • Watch git01 for errors and rollback if anomalies are detected
      • Allow half an hours worth of data to be collected
      • Revert git01 back to using a local Gitaly configuration, rolling back the change
      • Analyse the data
  • Downtime: Will the change introduce downtime, and if so, how much?

    • We do not foresee any downtime.

    • What options were considered to avoid downtime?

      • We are only deploying this change to a single git worker, git01.fe so that rollback can be performed very quickly
      • Additionally this will limit the load on the NFS servers, since only a single host will be sending traffic to the 10 backend NFS servers.
    • What is the downtime estimate based on? Can it be tested in some way?

      • We experienced a limited outage in our previous test
      • This downtime we the result of the NFS servers running the wrong version of Gitaly
      • Prior to our tests, we will ensure that the correct version of Gitaly is running on all hosts
  • People:

    • @ahmadsherif will execute the chef changes and will be standing by for rollback
    • @andrewn will handle communications and will monitor dashboards
    • @jacobvosmaer-gitlab will monitor dashboards and will run some manual tests to ensure that the change is working
  • Pre-checks: What should we check before starting with the change? Consider dashboards, metrics, limits of current infrastructure, etc.

    • We will be monitoring the following dashboards using our test:

    • Check that you have all the correct versions of the required software installed in the affected hosts.

    • Check that you have the right access level to the required resources.

    • Does the change alter how we use Azure or how many of Azure's resources we use? If so, consider opening an "advisory ticket" in the Azure portal to get input from their team.

      • This change does not alter how we use Azure
  • Change Procedure:

    • High-level overview of the change:
      1. Install the latest version of Gitaly on the NFS servers: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/671
      2. Check versions: using knife, confirm that the version of Gitaly running on the NFS servers is up-to-date and correct.
      3. Update git01.fe.gitlab.com with a network gitaly configuration
      4. Check that smart info refs are working: git -c http.sslVerify=false ls-remote https://git01.fe.gitlab.com/gitlab-org/gitlab-ce.git. On failure, rollback immediately.
      5. Monitor all the dashboards listed above: on any anomalies, rollback immediately
      6. Monitor for alerts from the alerting systems: on alerts, rollback immediately
      7. Continue to monitor progress for 30 minutes
      8. Rollback the configuration change on git01.fe.gitlab.com
  • Preparatory Steps: What can be done ahead of time? How far ahead?

  • Post-checks: What should we check after the change has been applied?

    • Monitor the dashboards listed above for increased error rates
    • Monitor alerting for Gitaly related alerts
    • Monitor Kibana for increased error rates and request durations in Workhorse
  • Rollback procedure: _In case things go wrong, what do we need to do to recover?

    • This test will be rolled back at conclusion of the test or prematurely if the test is not successful.
    • Remove the gitaly-over-network-production role from git01 run list followed by a chef-client.
    • It's highly unlikely that updating GitLab app version on the NFS node will cause a problem, but if it happened, the version will be rolled-back to 9.1.4-ee (the current version live there).
  • Create an invite using a 4 hr block of time on the "GitLab Production" calendar (link in handbook), inviting the ops-contact group. Include a link to the issue. (Many times you will not expect to need - or actually need - all 4 hrs, but past experience has shown that delays and unexpected events are more likely than having things go faster than expected.)

  • Ping the Production Lead in this issue to coordinate who should be present from the Production team, and to confirm scheduling.

  • When will this occur? 08h00 to 12h00 UTC on 31 May 2017

  • Communication plan:

DOING THE CHANGE

Preparatory steps

  • Copy/paste items here from the Preparatory Steps listed above.

Initial Tasks

  • Create a google doc to track the progress. This is because in the event of an outage, Google docs allow for real-time collaboration, and don't depend on GitLab.com being available.
    • Add a link to the issue where it comes from, copy and paste the content of the issue, the description, and the steps to follow.
    • Title the steps as "timeline". Use UTC time without daylight saving, we all are in the same timezone in UTC.
    • Link the document in the on-call log so it's easy to find later.
    • Right before starting the change, paste the link to the google doc in the #production chat channel and "pin" it.
  • Discuss with the person who is introducing the change, and go through the plan to fill the gaps of understanding before starting.
  • Final check of the rollback plan and communication plan.
  • Set PagerDuty maintenance window before starting the change.

The Change

  • Before starting the Change

    • Tweet to publicly notify that you are performing a change in production following the guidelines.
  • Start running the changes. When this happens, one person is making the change, the other person is taking notes of when the different steps are happening. Make it explicit who will do what.

  • When the change is done and finished, either successfully or not

    • Tweet again to notify that the change is finished and point to the change issue.
    • Copy the content of the document back into the issue redacting any data that is necessary to keep it blameless and deprecate the doc.
    • Perform a quick post mortem following the Blameless Postmortem guideline in the infrastructure handbook in a new issue.
    • If the issue caused an outage, or service degradation, label the issue as "outage".
Edited by Andrew Newdigate