Skip to content

Migration of GitLab.com to Cloudflare

Production Change - Criticality 1 C1

Change Objective Describe the objective of the change
Change Type ConfigurationChange|Operation
Services Impacted ServiceAPI ServiceCI Runners ~"Service::Customers" ServiceContainer Registry ServiceForum ServiceGit ServiceGitLab Rails ServiceGrafana ServiceHAProxy ServiceInfrastructure ServiceLicense
Change Team Members @T4cC0re -- we will be on the #incident-management zoom
Change Criticality C1
Change Reviewer @hphilipps (Acting Engineer on Call)
Tested in staging the change was tested on staging
Dry-run output If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result
Due Date Saturday, 28th of March 2020, 11:00 UTC
Time tracking Scheduled window of 1 hour for the actual change
Downtime Component None expected. There might be a disruption in DNS resolution for a few minutes. After the migration there is a non zero chance of undetermined downtime.

Detailed steps for the change

Pre execution conditions:

  • gitlab.com is in pending state in the Cloudflare UI
  • Under gitlab.com / SSL/TLS / Edge Certificates there is a gitlab.com, gprd.gitlab.com, www.gitlab.com Custom certificate in Active state.
  • Under gitlab.com / Spectrum there are 4 spectrum apps (altssh.gitlab.com:443 & gitlab.com:22/80/443).
  • Under gitlab.com / Overview these 2 NS servers are listed as target NS servers: diva.ns.cloudflare.com and jermaine.ns.cloudflare.com.
  • A tf plan on gprd and ops does not yield any diffs in DNS records or Cloudflare resources.
  • Verify, the bin/checkTraffic.sh script in the chef repo yields a numeric output > 0, when executed like this:
$ bin/checkTraffic.sh gprd | sort -Vu | wc -l
61748
  • Verify,
    • dig +short gitlab.com shows 35.231.145.151
    • dig +short altssh.gitlab.com shows 35.190.168.187
    • dig +short gitlab.com @diva.ns.cloudflare.com should not return anything
    • dig +short NS gitlab.com shows (order might change)
ns-1373.awsdns-43.org.
ns-1644.awsdns-13.co.uk.
ns-505.awsdns-63.com.
ns-705.awsdns-24.net.
  • Verify, openssl s_client -connect gitlab.com:443 </dev/null 2>/dev/null | openssl x509 -fingerprint -noout shows
SHA1 Fingerprint=F3:85:85:A1:04:F3:77:26:CF:CC:E5:CE:E2:23:ED:63:A1:8F:54:DC
  • Verify, curl -IXGET https://gitlab.com/cdn-cgi/trace yields a 302 Found with a redirect to login.
  • Open the traffic overview dashboard and keep it open
    • During the migration you should see a new zone popping up on the graphs (gitlab.com). If it does not, even after the zone has been activated, refresh the page.
    • The HAProxy gprd 2xx number should not plummet into the ground. This means traffic is lost. A reduction in the curve however, might happen.
    • Eventually you should see Cloudflare gitlab.com 2xx gaining traffic. The number will not match HAProxy, as the latter also includes internal traffic.

Execution steps

  • Log into AWS console and into Route 53.
  • Visit the domain settings for gitlab.com
  • Under Name Servers click Add or edit name servers
    • Remove all nameserver entries
    • Enter diva.ns.cloudflare.com and jermaine.ns.cloudflare.com and hit Update
  • Go to the Cloudflare dashboard for gitlab.com
    • Under gitlab.com / Overview hit Re-check now. to have Cloudflare re-check the DNS records.

NOTE: The period between AWS setting the new nameservers and Cloudflare registering the change might lead to gitlab.com not resolving for a brief period. But AWS' DNS servers will continue to resolve the traffic. What can happen, is that a client asks Cloudflares DNS servers, and they are not yet activated for the zone. However, the TTL of 5 minutes should reduce the impact of this.

Post execution validation steps

  • Verify,
    • dig +short gitlab.com @diva.ns.cloudflare.com should yield an IP in the Cloudflare range
    • dig +short gitlab.com shows the same IP (this might be dependant on the DNS caches along the way and take a few minutes)
    • dig +short AAAA gitlab.com resolves to an IPv6.
    • dig +short altssh.gitlab.com shows a CNAME to <hexcharacters>.pacloudflare.com
    • dig +short NS gitlab.com shows (order might change, and again depends on DNS caches)
diva.ns.cloudflare.com.
jermaine.ns.cloudflare.com.
  • Verify, openssl s_client -connect gitlab.com:443 </dev/null 2>/dev/null | openssl x509 -fingerprint -noout still shows
SHA1 Fingerprint=F3:85:85:A1:04:F3:77:26:CF:CC:E5:CE:E2:23:ED:63:A1:8F:54:DC
  • Verify, curl -i https://gitlab.com/cdn-cgi/trace yields a 200 OK with content similar to this
HTTP/2 200
date: Thu, 26 Mar 2020 14:50:18 GMT
content-type: text/plain
[...]

fl=<HEX CHARS>
h=gitlab.com
ip=<YOUR IP>
ts=1585234218.893
visit_scheme=https
uag=curl/7.69.1
colo=FRA
http=http/2
loc=DE
tls=TLSv1.3
sni=plaintext
warp=off
  • Clone a project via SSH and HTTPS
    • GIT_SSH_COMMAND="ssh -v" git clone git@gitlab.com:T4cC0re/linux-snapshot.git /tmp/clone_ssh
    • GIT_CURL_VERBOSE=1 git clone https://gitlab.com/T4cC0re/linux-snapshot.git /tmp/clone_https
    • GIT_SSH_COMMAND="ssh -v" git clone ssh://git@altssh.gitlab.com:443/T4cC0re/linux-snapshot.git /tmp/clone_altssh
  • Every 5-10 minutes verify, the bin/checkTraffic.sh script outputs lower and lower numbers, when executed like this:
$ bin/checkTraffic.sh gprd | sort -Vu | wc -l
61748

Rollback steps

NOTE: Due to this being a change, which is primarily DNS driven, a rollback should only be the last option, as it's effect can be delayed for up to 48 hours depending on DNS resolver caches between the customer and us. If at all possible, try rolling forward and document the steps taken. A rollback might make things worse.

  • Log into AWS console and into Route 53.
  • Visit the domain settings for gitlab.com
  • Under Name Servers click Add or edit name servers
    • Remove all nameserver entries
    • Enter these nameservers and hit Update
      • ns-1373.awsdns-43.org
      • ns-1644.awsdns-13.co.uk
      • ns-505.awsdns-63.com
      • ns-705.awsdns-24.net
  • Traffic should flow away from Cloudflare over the next time.

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out

/label C1 changeunscheduled

Edited by Hendrik Meyer (xLabber)