Migration of GitLab.com to Cloudflare
C1
Production Change - Criticality 1Change Objective | Describe the objective of the change |
---|---|
Change Type | ConfigurationChange|Operation |
Services Impacted | ServiceAPI ServiceCI Runners ~"Service::Customers" ServiceContainer Registry ServiceForum ServiceGit ServiceGitLab Rails ServiceGrafana ServiceHAProxy ServiceInfrastructure ServiceLicense |
Change Team Members | @T4cC0re -- we will be on the #incident-management zoom |
Change Criticality | C1 |
Change Reviewer | @hphilipps (Acting Engineer on Call) |
Tested in staging | the change was tested on staging |
Dry-run output | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
Due Date | Saturday, 28th of March 2020, 11:00 UTC |
Time tracking | Scheduled window of 1 hour for the actual change |
Downtime Component | None expected. There might be a disruption in DNS resolution for a few minutes. After the migration there is a non zero chance of undetermined downtime. |
Detailed steps for the change
Pre execution conditions:
-
gitlab.com is in pending state in the Cloudflare UI -
Under gitlab.com
/SSL/TLS
/Edge Certificates
there is agitlab.com, gprd.gitlab.com, www.gitlab.com
Custom
certificate inActive
state. -
Under gitlab.com
/Spectrum
there are 4 spectrum apps (altssh.gitlab.com
:443
&gitlab.com
:22
/80
/443
). -
Under gitlab.com
/Overview
these 2 NS servers are listed as target NS servers:diva.ns.cloudflare.com
andjermaine.ns.cloudflare.com
. -
A tf plan
ongprd
andops
does not yield any diffs in DNS records or Cloudflare resources. -
Verify, the bin/checkTraffic.sh
script in the chef repo yields a numeric output > 0, when executed like this:
$ bin/checkTraffic.sh gprd | sort -Vu | wc -l
61748
-
Verify, -
dig +short gitlab.com
shows35.231.145.151
-
dig +short altssh.gitlab.com
shows35.190.168.187
-
dig +short gitlab.com @diva.ns.cloudflare.com
should not return anything -
dig +short NS gitlab.com
shows (order might change)
-
ns-1373.awsdns-43.org.
ns-1644.awsdns-13.co.uk.
ns-505.awsdns-63.com.
ns-705.awsdns-24.net.
-
Verify, openssl s_client -connect gitlab.com:443 </dev/null 2>/dev/null | openssl x509 -fingerprint -noout
shows
SHA1 Fingerprint=F3:85:85:A1:04:F3:77:26:CF:CC:E5:CE:E2:23:ED:63:A1:8F:54:DC
-
Verify, curl -IXGET https://gitlab.com/cdn-cgi/trace
yields a302 Found
with a redirect to login. -
Open the traffic overview dashboard and keep it open - During the migration you should see a new zone popping up on the graphs (
gitlab.com
). If it does not, even after the zone has been activated, refresh the page. - The
HAProxy gprd 2xx
number should not plummet into the ground. This means traffic is lost. A reduction in the curve however, might happen. - Eventually you should see
Cloudflare gitlab.com 2xx
gaining traffic. The number will not match HAProxy, as the latter also includes internal traffic.
- During the migration you should see a new zone popping up on the graphs (
Execution steps
-
Log into AWS console and into Route 53. -
Visit the domain settings for gitlab.com -
Under Name Servers
clickAdd or edit name servers
-
Remove all nameserver entries -
Enter diva.ns.cloudflare.com
andjermaine.ns.cloudflare.com
and hitUpdate
-
-
Go to the Cloudflare dashboard for gitlab.com -
Under gitlab.com
/Overview
hitRe-check now.
to have Cloudflare re-check the DNS records.
-
NOTE: The period between AWS setting the new nameservers and Cloudflare registering the change might lead to gitlab.com not resolving for a brief period. But AWS' DNS servers will continue to resolve the traffic. What can happen, is that a client asks Cloudflares DNS servers, and they are not yet activated for the zone. However, the TTL of 5 minutes should reduce the impact of this.
Post execution validation steps
-
Verify, -
dig +short gitlab.com @diva.ns.cloudflare.com
should yield an IP in the Cloudflare range -
dig +short gitlab.com
shows the same IP (this might be dependant on the DNS caches along the way and take a few minutes) -
dig +short AAAA gitlab.com
resolves to an IPv6. -
dig +short altssh.gitlab.com
shows a CNAME to<hexcharacters>.pacloudflare.com
-
dig +short NS gitlab.com
shows (order might change, and again depends on DNS caches)
-
diva.ns.cloudflare.com.
jermaine.ns.cloudflare.com.
-
Verify, openssl s_client -connect gitlab.com:443 </dev/null 2>/dev/null | openssl x509 -fingerprint -noout
still shows
SHA1 Fingerprint=F3:85:85:A1:04:F3:77:26:CF:CC:E5:CE:E2:23:ED:63:A1:8F:54:DC
-
Verify, curl -i https://gitlab.com/cdn-cgi/trace
yields a200 OK
with content similar to this
HTTP/2 200
date: Thu, 26 Mar 2020 14:50:18 GMT
content-type: text/plain
[...]
fl=<HEX CHARS>
h=gitlab.com
ip=<YOUR IP>
ts=1585234218.893
visit_scheme=https
uag=curl/7.69.1
colo=FRA
http=http/2
loc=DE
tls=TLSv1.3
sni=plaintext
warp=off
-
Clone a project via SSH and HTTPS -
GIT_SSH_COMMAND="ssh -v" git clone git@gitlab.com:T4cC0re/linux-snapshot.git /tmp/clone_ssh
-
GIT_CURL_VERBOSE=1 git clone https://gitlab.com/T4cC0re/linux-snapshot.git /tmp/clone_https
-
GIT_SSH_COMMAND="ssh -v" git clone ssh://git@altssh.gitlab.com:443/T4cC0re/linux-snapshot.git /tmp/clone_altssh
-
-
Every 5-10 minutes verify, the bin/checkTraffic.sh
script outputs lower and lower numbers, when executed like this:
$ bin/checkTraffic.sh gprd | sort -Vu | wc -l
61748
Rollback steps
NOTE: Due to this being a change, which is primarily DNS driven, a rollback should only be the last option, as it's effect can be delayed for up to 48 hours depending on DNS resolver caches between the customer and us. If at all possible, try rolling forward and document the steps taken. A rollback might make things worse.
-
Log into AWS console and into Route 53. -
Visit the domain settings for gitlab.com -
Under Name Servers
clickAdd or edit name servers
-
Remove all nameserver entries -
Enter these nameservers and hit Update
ns-1373.awsdns-43.org
ns-1644.awsdns-13.co.uk
ns-505.awsdns-63.com
ns-705.awsdns-24.net
-
-
Traffic should flow away from Cloudflare over the next time.
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
/label C1 changeunscheduled
Edited by Hendrik Meyer (xLabber)