Skip to content

Test Gitaly over the Network in Production

Video Call: https://gitlab.zoom.us/j/662245460

What is the Gitaly Team doing

The Gitaly team would like to perform a simple experiment, documented in gitlab-org/gitaly#181 (closed).

The test involves:

  1. Starting an instance of Gitaly on each NFS server
  2. At this point, the NFS Gitaly instances will not be receiving any traffic
  3. Next, reconfigure a single Git host to route it's Gitaly traffic to the NFS servers instead of the local Gitaly instance.
  4. In this configuration, we'll be using grpc traffic instead of NFS traffic.
  5. We will observe performance effects of doing this, comparing timings from the host that is using Network Gitaly with the other hosts that are using a local instance of Gitaly and accessing their volumes over NFS.

PLANNING THE CHANGE

  • Context: In GitLab 9.2, we'll be testing running Gitaly on the NFS servers (gitlab-org/gitaly#181 (closed)). Currently, GitLab is deployed onto the NFS servers on an adhoc basis. We need to ensure that GitLab is deployed with each release.
  • Downtime: No downtime is expected.
    • What options were considered to avoid downtime?
    • What is the downtime estimate based on? Can it be tested in some way?
  • People:
  • Pre-checks: What should we check before starting with the change? Consider dashboards, metrics, limits of current infrastructure, etc.
    • Dashboards: we have created multiple dashboards to monitor what we're doing gitaly-nfs-metrics, gitaly-host-metrics gitaly and gitaly-features
    • Does the change alter how we use Azure or how many of Azure's resources we use? If so, consider opening an "advisory ticket" in the Azure portal to get input from their team.
  • Change Procedure:
    • The change procedure is documented in the next section. See Below
    • Did you do a dry run to test / measure performance and timings?
      • Yes: on staging. The test was successful.
  • Preparatory Steps: What can be done ahead of time? How far ahead?
    • Dashboards built
  • Post-checks: What should we check after the change has been applied?
    • We will monitor the dashboards to ensure that systems are running well
    • After the test is complete, we will ensure that no traffic is flowing to the Network Gitaly instances.
    • Should any alerts be modified as a consequence of this change? No.
  • Rollback procedure: _In case things go wrong, what do we need to do to recover?
    • The only rollback step necessary is after the git work configuration has been changed. We will Also consider rolling back from an intermediate step: does the procedure change depending on how far along the procedure is?_
  • Create an invite using a 4 hr block of time on the "GitLab Production" calendar (link in handbook), inviting the ops-contact group. Include a link to the issue. (Many times you will not expect to need - or actually need - all 4 hrs, but past experience has shown that delays and unexpected events are more likely than having things go faster than expected.)
  • Ping the Production Lead in this issue to coordinate who should be present from the Production team, and to confirm scheduling.
  • When will this occur? 12h30 UTC Friday 26 May 2017
  • Communication plan:
    • Tweet: default to tweeting when schedule is known, then again 12 hrs before, 1 hr before, when starting, during if there are delays, and after when complete.
    • Deploy banner: display warning 1 hr before
    • Other?

DOING THE CHANGE

Steps

There are three separate changes involved in this request:


  • Copy/paste items here from the Preparatory Steps listed above.

Initial Tasks

  • Create a google doc to track the progress. This is because in the event of an outage, Google docs allow for real-time collaboration, and don't depend on GitLab.com being available. https://docs.google.com/document/d/1j-SgncxH3dCySid8OgGvdykhivKS3L5Smi70R6Zf-9s/edit
    • Add a link to the issue where it comes from, copy and paste the content of the issue, the description, and the steps to follow.
    • Title the steps as "timeline". Use UTC time without daylight saving, we all are in the same timezone in UTC UTC, by definition, does not have daylight saving.
    • Link the document in the on-call log so it's easy to find later.
    • Right before starting the change, paste the link to the google doc in the #production chat channel and "pin" it.
  • Discuss with the person who is introducing the change, and go through the plan to fill the gaps of understanding before starting.
  • Final check of the rollback plan and communication plan.
  • Set PagerDuty maintenance window before starting the change.

The Change

  • Start running the changes. When this happens, one person is making the change, the other person is taking notes of when the different steps are happening. Make it explicit who will do what.
  • When the change is done and finished, either successfully or not, copy the content of the document back into the issue and deprecate the doc (and close the issue if possible).
  • Retrospective: answer the following three questions:
    • What went well?
    • What should be improved?
    • Specific action items / recommendations.
  • If the issue caused an outage, or service degradation, label the issue as "outage".
Edited by Andrew Newdigate