Incident Review: GitLab.com certificate changed

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. External SaaS customers using self-hosted runners.
2. Users with outdated systems (eg old docker image, old java versions) that has old certificate chains in the system, and therefore were unable to resolve the CA chain for let's encrypt.
3. Users with AWS OIDC integration, since AWS requires providing the SSL certificate fingerprint, which changed with the new SSL certificate.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. GitLab Runner jobs failing: We saw multiple issues during job execution like unable to get local issuer certificate, and self-signed certificate in certificate chain. The first time GitLab Runner opens a connection to GitLab.com it saves the certificate chain, and then uses the chain for all the git commands that it runs. These connections are long-standing so there is a high chance that when the connection opens it gets 1 certificate and then when it runs the git commands the chain changes for GitLab.com resulting in git not accepting the certificate.
2. Let's Encrypt CA not trusted: The CA for Let's Encrypted was issued in 2015, and some legacy systems still don't trust the CA. To make things harder there are also bugs in old OpenSSL versions so this forced users to update without any heads-up or guideline.
3. AWS OIDC thumbprint: When using AWS OIDC you are required to provide a thumbprint from the certificate since this changed users needed to update the certificate.
How many customers were affected?

At our peak in AMER, we had around 100 tickets in the queue for this incident. EMEA and APAC also saw similar numbers. A graph from this time period 12-13 to 12-14 shows an estimated 278 tickets in queue for Saas CI/CD related issues

Twitter views at their peak were around 1.3k in number.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?

A significant drop in traffic occurred after the second SSL was issued and workarounds were in place

What were the root causes?

The old certificate (before 2023-12-13 19:30 UTC) was based on Digicert CA, and managed by Cloudflare. The Digicert CA was deprecated, and renewals were no longer going to work. The certificate was set to expire on 2024-01-25, therefore we had to change the certificate either way.

We came across this problem while were attempting to add new fqdns to the certificate as part of gitlab-org/architecture/gitlab-gcp-integration/glgo#3, which failed due to the Digicert CA deprecation.

Ahead of the production change, we updated the Certificate Authority to Let's Encrypt on Staging. We then proceeded with updating the Production Certificate's Certificate Authority from Digicert to Let's Encrypt, and the new FQDNs were also added.

The Let's Encrypt certificate has different CA Chains:

source

There are 2 CAs/Intermediates being used, ISRG Root X1 and ISRG Root X2.

X1 -> R3 is used for RSA (issued in 2015)
X2 -> E1 is used for ECDSA (issued in 2020)

This means any legacy systems that haven't had their CA updated before 2015 will no longer trust this certificate, there are also some systems with older versions of openssl which has bugs with Let's Encrypt Certificats

We had multiple user-facing issues (described below) but the biggest one was the legacy systems, this meant users would have had to update legacy systems with the new store, and new openssl version to fully trust GitLab.com certificate. Since we didn't give any notice or guidelines to our users we decided to change the certificate one more time this time using an older CA that is Sectigo via our other certificate provider SSLMate, we did change on 2023-12-14 09:26 UTC.

Incident Response Analysis

How was the incident detected?: Support received notifications from Customers using Self-Hosted runners in Gitlab.com not being able to resolve the SSL certificate in their jobs.
How could detection time be improved?: External Probes fingerprinting the SSL Certificate, alerting when it changes for example https://sslmate.com/certspotter/
How was the root cause diagnosed?: The team was aware of the SSL change, and since it happened shortly before the incident it was fairly straightforward to correlate.
How could time to diagnosis be improved?: N/A
How did we reach the point where we knew how to mitigate the impact?: We decided to go with a Custom SSL certificate from a more established CA chain, that was more likely to be supported by older systems. We knew this meant asking customers to restart their self-hosted runners, but it was a fair trade off since it was easier than having customers updating the CA chain of outdated systems, and the number was lower.
How could time to mitigation be improved?:
1. Have a documented way where private keys for SSLMate are stored 👉 production-engineering#24892

Post Incident Analysis

Did we have other events in the past with the same root cause?: No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?: No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- production-engineering#24875 (closed)
- https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/7431

What went well?

Easy path to escalate to leadership to take business decisions.
Good collaboration between SREs and the Support team.
Having multiple SSL providers set up and ready to go.

Corrective Actions

Guidelines

Blameless RCA Guideline

Edited Jan 19, 2024 by Nick Nguyen