Test Gitaly over the Network in Production
Video Call: https://gitlab.zoom.us/j/662245460
What is the Gitaly Team doing
The Gitaly team would like to perform a simple experiment, documented in gitlab-org/gitaly#181 (closed).
The test involves:
- Starting an instance of Gitaly on each NFS server
- At this point, the NFS Gitaly instances will not be receiving any traffic
- Next, reconfigure a single Git host to route it's Gitaly traffic to the NFS servers instead of the local Gitaly instance.
- In this configuration, we'll be using grpc traffic instead of NFS traffic.
- We will observe performance effects of doing this, comparing timings from the host that is using Network Gitaly with the other hosts that are using a local instance of Gitaly and accessing their volumes over NFS.
PLANNING THE CHANGE
-
Context: In GitLab 9.2, we'll be testing running Gitaly on the NFS servers (gitlab-org/gitaly#181 (closed)). Currently, GitLab is deployed onto the NFS servers on an adhoc basis. We need to ensure that GitLab is deployed with each release. -
Downtime: No downtime is expected. -
What options were considered to avoid downtime? -
What is the downtime estimate based on? Can it be tested in some way?
-
-
People: -
Pre-checks: What should we check before starting with the change? Consider dashboards, metrics, limits of current infrastructure, etc. -
Dashboards: we have created multiple dashboards to monitor what we're doing gitaly-nfs-metrics, gitaly-host-metrics gitaly and gitaly-features -
Does the change alter how we use Azure or how many of Azure's resources we use? If so, consider opening an "advisory ticket" in the Azure portal to get input from their team.
-
-
Change Procedure: -
The change procedure is documented in the next section. See Below -
Did you do a dry run to test / measure performance and timings? - Yes: on staging. The test was successful.
-
-
Preparatory Steps: What can be done ahead of time? How far ahead? -
Dashboards built
-
-
Post-checks: What should we check after the change has been applied? -
We will monitor the dashboards to ensure that systems are running well -
After the test is complete, we will ensure that no traffic is flowing to the Network Gitaly instances. -
Should any alerts be modified as a consequence of this change?No.
-
-
Rollback procedure: _In case things go wrong, what do we need to do to recover? -
The only rollback step necessary is after the git work configuration has been changed. We will Also consider rolling back from an intermediate step: does the procedure change depending on how far along the procedure is?_
-
-
Create an invite using a 4 hr block of time on the "GitLab Production" calendar (link in handbook), inviting the ops-contact group. Include a link to the issue. (Many times you will not expect to need - or actually need - all 4 hrs, but past experience has shown that delays and unexpected events are more likely than having things go faster than expected.) -
Ping the Production Lead in this issue to coordinate who should be present from the Production team, and to confirm scheduling. -
When will this occur? 12h30 UTC Friday 26 May 2017 -
Communication plan:-
Tweet: default to tweeting when schedule is known, then again 12 hrs before, 1 hr before, when starting, during if there are delays, and after when complete. -
Deploy banner: display warning 1 hr before -
Other?
-
DOING THE CHANGE
Steps
There are three separate changes involved in this request:
-
Ensure that Gitaly is always deployed to the NFS servers, and this deploy happens before it happens on any of the worker hosts: at this point Gitaly will be deployed, but not running on the NFS servers. (https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/651) -
Ensure that Gitaly is turned on on the NFS servers: at this point the Gitaly process will be accepting requests but no requests will be sent from clients. (https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/671, changes are live, only a merge is needed to seal the deal) -
Create a new role which can be assigned to worker hosts to configure them to communicate with network Gitaly. (https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/672) -
Assign the role to the git01node for testing, so that we can compare the performance with and without network Gitaly. -
Monitor traffic in Grafana using https://performance.gitlab.net/dashboard/db/gitaly?orgId=1, https://performance.gitlab.net/dashboard/db/gitaly-host-metrics?orgId=1 and https://performance.gitlab.net/dashboard/db/gitaly-features?orgId=1 -
After a period of time decided during the test, disable the new config on git01
-
Copy/paste items here from the Preparatory Steps listed above.
Initial Tasks
-
Create a google doc to track the progress. This is because in the event of an outage, Google docs allow for real-time collaboration, and don't depend on GitLab.com being available. https://docs.google.com/document/d/1j-SgncxH3dCySid8OgGvdykhivKS3L5Smi70R6Zf-9s/edit -
Add a link to the issue where it comes from, copy and paste the content of the issue, the description, and the steps to follow. -
Title the steps as "timeline". Use UTC time without daylight saving, we all are in the same timezone in UTCUTC, by definition, does not have daylight saving. -
Link the document in the on-call log so it's easy to find later. -
Right before starting the change, paste the link to the google doc in the #production chat channel and "pin" it.
-
-
Discuss with the person who is introducing the change, and go through the plan to fill the gaps of understanding before starting. -
Final check of the rollback plan and communication plan. -
Set PagerDuty maintenance window before starting the change.
The Change
-
Start running the changes. When this happens, one person is making the change, the other person is taking notes of when the different steps are happening. Make it explicit who will do what. -
When the change is done and finished, either successfully or not, copy the content of the document back into the issue and deprecate the doc (and close the issue if possible). -
Retrospective: answer the following three questions: -
What went well? -
What should be improved? -
Specific action items / recommendations.
-
- If the issue caused an outage, or service degradation, label the issue as "outage".
Edited by Andrew Newdigate