Roll out Cloud NAT to CI shared runners
C2
Production Change - Criticality 2Change Objective | Roll out Cloud NAT to CI shared runners |
---|---|
Change Type | Type described above |
Services Impacted | List services |
Change Team Members | @hphilipps and @craigf |
Change Severity | C2 |
Buddy check | A colleague will review the change |
Tested in staging | Not tested in a non-prod environment, but private runners are already running without private IPs, behind Cloud NAT. |
Schedule of the change | Probably the week beginning 30th September 2019. |
Duration of the change | Quick to execute, rolls out slowly over several hours. Same for rollback |
Steps
- Ensure that a behavioural change in CI is communicated to customers ahead of time: jobs will no longer be able to open an arbitrarily large number of concurrent connections to the same address (ip:port). Initially, this concurrency will be limited to 256 over a 2 minute period but we reserve the right to decrease it further. Concurrent connections to different addresses are unaffected.
- Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1857 and run the manual apply-to-prod step in the master pipeline.
- Once the shared runner managers have had chef run (up to 30 mins later) all new docker-machine runners will be provisioned with no public IP and outbound traffic will automatically exit via the Cloud NAT.
- Keep an eye on the NAT dashboard every few hours, but this is not particularly necessary as there is no alerting in place: elevated error rates (dropped packets) may occur if jobs make many concurrent connections to the same address, but this is by design.
Rollback
- Rollback: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1857, apply-to-prod from that master pipeline.
Edited by Craig Furman