[CHANGE] Move the front-end nodes to the new production virtual network
Planning the change
- Context: The front-end nodes need to be moved to the new network. Issue here: https://gitlab.com/gitlab-com/infrastructure/issues/1708
- Downtime: No downtime is expected.
- People:
- Pre-checks: Make sure that Terraform can provision the new nodes autonomously.
- Change Procedure:
- Stop
chef-clienton the old node - Create a new node with Terraform. This replaces it in Chef.
- Verify the new node.
- Replace the old node with the new one the load balancer.
- (After a day or two) Delete the old node.
- Stop
- Preparatory Steps: We successfully tested a full provisioning of a web node.
- Post-checks: GitLab.com should be serving the same traffic as if nothing has happened.
- We should run chef-client on prometheus to update the monitoring.
- Rollback procedure: We simply move back to the old nodes, which will be kept alive for a while after the move.
- I have created an invite in the production calendar as per template.
- The change will take place on May 6th from 12:00 to 16:00 UTC.
Timeline of the change (UTC)
- 12:00 - Started to rebuild web02, web03 and web04.
- 12:09 - Switched web01 on the load balancer.
- 12:32 - Found a minor bug in the gitlab-fe-web role, quickly fixed by @jtevnan.
- 12:36 - Started to rebuild all the git nodes.
- 13:09 - Started to rebuild all the api nodes.
- 13:28 - Fixed a glitch in the new front-end that was sporadically causing a 500 error from workhorse.
- 13:30 - Switched the web front-end to the new fleet.
- 14:00 - Switched the git front-end to the new fleet.
- 14:30 - Switched the api front-end to the new fleet.
- 15:00 - Finished to validate the fleet.
Retrospective
What went well
- The front-end fleet has been migrated successfully with minimal impact on the users.
- The team was very reactive to resolve all the issues found in the process.
- Having @ilyaf shadowing for the entire change was great from the onboarding perspective.
- We successfully tested on the field some of the lessons freshly learned in https://gitlab.com/gitlab-com/infrastructure/issues/1691.
What should be improved
-
We found a race condition in the rake task in charge of updating the Chef vaults for the new nodes. As a possible workaround we added a random wait time before performing the actual
knifecommand but that didn't work. Since debugging this issue would require a good time investment and we won't be changing the size of our fleet that much anytime soon, we decided to focus on https://gitlab.com/gitlab-com/infrastructure/issues/1212 and eradicate the issue for good. -
We wasted quite a bit of time in reverse engineering a few aspects of the infrastructure. For example, we weren't aware that pages got moved to
fe-lb07andfe-lb08and sincelb10andlb11were still up we assumed they were still serving it. We should strive to further simplify and document our infrastructure in addition to delete unused resources. About this particular example I'd also like to rename those load balancers to something more explicit, likepages-lb01andpages-lb02. -
Since the plan was to leave the old nodes online in order to have a fast rollback plan, we ended up hitting the maximum connections limit on the db resulting in some requests returning an error. We quickly overcame this by stopping all the GitLab services on the old nodes but it's still something extremely important to remember since our infrastructure is likely to grow in the future.
/cc @gl-infra