NFS Server Storage Migration

Per #1649 (closed), this issue is to migrate the NFS servers from legacy Azure storage groups to managed storage.

Execution timeline

Timeline - All time are in UTC:

5:00 - John starts
5:21 - We tweeted that we are starting with the conversion process
- Bring up deploy page - DOWNTIME
- Stop services in git ssh.
- Shut down NFS 1-8, LFS, uploads, shared.
5:26 - We take all all the NFS hosts down
5:32 - Ahmad gets paged because we forgot to schedule a maintenance window
5:57 - We have NFS01 converted with the disks attached an up - We start validating the file system by sampling random data.
6:03 - We tweeted to let the people know that we are taking a bit longer
6:05 - We have NFS-file-02 up and ready.
6:10 - We have NFS-file-03 - validated
6:12 - We have NFS-file-04 - validated
6:14 - All disk conversions are done now.
6:15 - We have NFS-file-05 - and it’s validated
6:19 - We tweeted “We already have 5/12 done and verified, apologies for the inconvenience”
6:19 - We have NFS-file-06 - validated
6:24 - We have NFS-file-07 - validated
6:30 - We have NFS-file-08 - validated
6:38 - We have uploads-01 - validated
6:40 - We have lfs-01 - validated
6:45 - We have share-01 - validated
6:46 - We have artifacts-01 - validated
6:50 - We tweeted “We have 12/12 filesystems attached, we are validating the last 3 of them”
6:55 - Daniele’s daugther is making a lot of mess not allowing his daddy to work.
6:58 - We tweeted “All filesystems have passed validation and are being mounted on the hosts”
7:00 - We are validating the list of new ips
7:03 - We have the list of new ips setup.
7:15 - We are in the process of updating all the fstab files on the hosts - Looks like we have stale filehandlers.
7:24 - We apply the NFS umount across the fleet.
7:25 - Tweet "We are changing the NFS mount points across the fleet, apologies for the delay"
7:37 - It took solid 20 minutes for worker-web01 to reboot - we where thinking that it got locked because of a mounpoint failure, we were happily wrong.
7:40 - We can’t mount NFS partitions on the workers
7:41 - We troubleshoot this to be an issue in the peering network agreement - we don’t have the workers in there.
7:43 - We mount the first NFS partition on the workers
7:55 - We are remounting all the NFS partitions across the fleet.
7:57 - !tweet "We are currently remounting all the drives after dealing with some stale NFS issues. Apologies for the delay"
8:03 - NFS remount
- Workers-web - done
  - 01
  - 02
  - 03
  - 04
- Workers-ssh - done
  - 01
  - 02
  - 03
  - 04
  - 05
  - 06
  - 07
  - 08
  - 09
  - 10
- Workers-api - done
  - 01
  - 02
  - 03
  - 04
  - 05
  - 06
  - 07
  - 08
- Workers-sidekiq - done
  - 01
  - 02
  - 03
  - 04
  - 05
  - 06
  - 07
  - 08
  - 09
  - 10
- All done.
08:14 - Double checking everything
- Number of mount point in the fleet = 16
  - Git = 16 * 10
  - Sidekiq = 16 * 10
  - Api = 16 * 8
  - Web = 16 * 5
- Ready to roll
8:20 - bringing GitLab.com back up
- Starting services all around
8:23 - We are back up
8:23 - tweet "GitLab.com is back up with a new fresh managed disks fleet"
8:25 - total of 2 HS downtime

Plan

Pre-Migration:

At 23:00 UTC, April 29th the following actions will be performed:

Terraform Create new NFS nodes

Migration:

Starting at 04:30 UTC, April 30th the following actions will be performed:

Send tweet to remind people that GitLab.com will be down at 05:00 UTC.
Deploy the Maintenance Page across GitLab
Stop all running GitLab services
Shutdown all NFS servers
Convert NFS servers data disks to managed disks using Azure CLI script
Attach managed disks to new NFS servers
Edit and Revise the NFS Chef Recipe to have new server IP Addresses
Verify Data Reachability from all 'worker' nodes
Start GitLab services
Remove the Maintenance Page
Send tweet to confirm service is back

Rollback Plan:

Because we are creating new NFS servers and using the Azure Disk Migration, the original NFS servers as well as their disks will remain untouched in this exercise. The Rollback plan is to revert the gitlab-nfs-cluster cookbook modifications to point back to the existing servers and start them back up.

Assignee Loading

Time tracking Loading