Deploy Gitter ws to production with the new AMI
What are we going to do?
Deploy Gitter websockets servers (ws
) to production using the most recent AMI's built as a part of #3723 (moved). These ASG's are provisioned by terraform. We are going to update terraform config and deploy the change which should result in 8 new Gitter ws-0x
services running.
Why are we doing it?
- We are running on AMI built in February 2018. The ansible task dependencies and ansible itself changed since then and if an EC2 instance dies the ASG might not be able to provision a new one.
- ws-06 and ws-07 are getting stopped for maintenance by AWS https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7538 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7537
When are we going to do it?
Deploying new ASG Launch configuration
- Start time: 2019/08/19 6:00 UTC
- Duration: 15 minutes
- Estimated end time: 2019/08/19 6:15 UTC
Replacing first 4 old instances
- Start time: 2019/06/20 6:00 UTC
- Duration: 15 minutes
- Estimated end time: 2019/08/20 6:15 UTC
Replacing remaining 4 old instances
- Start time: 2019/08/21 6:00 UTC
- Duration: 15 minutes
- Estimated end time: 2019/08/21 6:15 UTC
How are we going to do it?
Deploying new ASG Launch configuration
- Merge https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/merge_requests/137
- Update ASG by running
terraform apply
in accordance with the README. - Add one instance (
ws-09
) to the group and test the new Launch Configuration - If you need to add/remove more instances thanks to debuging, use Scale in protection so we won't accidentally terminate old instances.
- found https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7593 during this step
Replacing first 4 old instances
- terminating
ws-09
and returning the desired capacity to 8 - terminating
ws-01
-ws-04
- wait for 4 new instances to spin up
Replacing remaining 4 old instances
- terminating
ws-05
-ws-08
- wait for 4 new instances to spin up
How are we preparing for it?
-
create new AMI -
validate whether old AMI can be still booted up (if yes, that lowers the risk because we can rollback) - that can be done by increasing ASG size from 8 to 9 and see if the new instance successfully boots up.
-
understand how will we make sure that newly deployed EC2 instances are running the latest version of webapp
What can we check before starting?
- make sure that no production/staging deployments are happening
What can we check afterwards to ensure that it's working?
- validate that Gitter works by accessing https://gitter.im
- Checking Sentry and Datadog monitoring for errors
- Keep an eye on PagerDuty
Impact
- Type of impact: If the change fails, the impact could be Gitter downtime
- What will happen: Nothing negative
- Do we expect downtime? (set the override in pagerduty): None
How are we communicating this to our customers?
- Nothing required
What is the rollback plan?
- revert https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/merge_requests/137 and deploy using
terraform apply
Edited by Tomas Vik