2021-01-28 CI build tokens are not decrypting properly

Summary

The latest release to canary included a change that turned out to not be backward-compatible with the code in main-stage production. Consequently, auth tokens that were generated by the production environment's canary stage could not be decrypted by the main stage.

Draining canary has partially mitigated the problem. No new unusable tokens are being generated, but CI jobs that started during the affected timespan are still failing.

We are currently evaluating a couple options to fix the state of those tokens.

Update as of 2021-01-29 03:40 UTC:

We believe this issue is now resolved.

As of 2021-01-29 01:10 UTC, an updated build has been tested and deployed that can decrypt both flavors of auth token. Prior to that release, many of the previously unusable tokens were repaired, but this release is expected to cover all known cases. Monitoring over the last couple hours has not turned up any new cases.

Timeline

All times UTC.

2021-01-27

2021-01-27 22:17 - Start of errors decrypting auth tokens.
- This rise in error rate concurrently affected both the canary and main stages of the production environment.
- It coincides with the deploy to canary of GitLab release 13.9.202101271903-670aa594500.4226fdbd636. (See annotations of the Prometheus metrics in the Grafana screenshots below.)
- These errors were for HTTP requests that failed while trying to decrypt tokens that were created by the canary release that could not be decrypted by the previous release in main-stage.
- 99% of these requests were recurring attempts for CI jobs to publish their job trace output via API calls.
2021-01-27 23:08 - We begin investigating a tangentially related problem with the latest canary deploy: #3440 (closed)
- That incident lead to discovering this issue's problem.

2021-01-28

2021-01-28 00:20 - @stanhu declares incident in Slack.
2021-01-28 01:11 - Disabled canary api fleet. This partially prevented new auth tokens from being created by canary.
2021-01-28 01:26 - Disabled canary web fleet. This finished preventing canary from making new auth tokens.
2021-01-28 18:40 - Partial mitigation: Ad hoc repair of most of the auth tokens associated with CI jobs whose tokens could not be decrypted by the stable release in main-stage due to having been created by the canary release. See: #3441 (comment 496528048)
2021-01-28 20:19 - End of production's distinctive CipherError exceptions associated with the failing to decrypt auth tokens.
2021-01-28 21:04 - Fix additional tokens.

2021-01-29

2021-01-29 01:10 - Generic fix is fully applied to production. Either flavor of token can now be successfully decrypted.

Corrective Actions

test to ensure that we have backwards compatibility testing of CI tokens gitlab-com/www-gitlab-com#10546 (closed)
Due to the security-related nature of this component, additional corrective actions are being discussed in the confidential issue, relating to improving capacity requirements, adding referential metadata, avoiding future cross-compatibility corner cases, and other forward-looking additional design improvements. A partial topic list is here (internal only): https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3443#note_500935057

Metrics

Prometheus metrics: API error rates

From the perspective of Prometheus metrics, the fact that API error rates had spiked was clear, but the nature of the spike was not. The feature category was unknown. The PATCH HTTP method was the most frequently affected. This method is used by CI runners publishing their job trace output to the GitLab API service.

Dashboard panel

The rise in error rate corresponds with a deploy to the canary stage of production.

For reference, the canary stage also shows a similar rise in API error rate:

Dashboard panel

Kibana: Details of the decryption error log events

When an OpenSSL::Cipher::CipherError exception occurs, we often get multiple log events: one for the HTTP request and another with supplemental data such as the CI job's build_id, pipeline_id, and project_id. These log events are linked by sharing the same correlation_id. In the following notes, we mostly focus on the log events that contain the HTTP-related metadata, but for completeness, at the end we also show the other set of log events that have no HTTP fields but do have CI job metadata in the json.extra.* fields.

Here is the overall timeline of API HTTP calls that failed with the distinctive OpenSSL::Cipher::CipherError exceptions.

99% of these are from CI jobs repeatedly trying to send their job output to the /api/:version/jobs/:id/trace endpoint.

https://log.gprd.gitlab.net/goto/5dc050a06b2a618f7667c7e5ca6f774f

Excluding the job trace errors (/api/:version/jobs/:id/trace) most of the remaining 1% of failed API calls were CI jobs updating their status via the /api/:version/jobs/:id endpoint.

All remaining API errors excluding job trace uploads:

https://log.gprd.gitlab.net/goto/334088751d9d44b869e170c8fab47b90

Most of the above are job status updates:

https://log.gprd.gitlab.net/goto/a4f532f30653e605ad0c70aad63725fb

Excluding all of the CI job API calls, the handful of remaining API calls were for project and group endpoints.

https://log.gprd.gitlab.net/goto/a4f532f30653e605ad0c70aad63725fb

The above examples all focus on the log events containing fields describing the HTTP request and response.

Next we look at another group of log events that have no fields directly related to HTTP requests. Many of these log events give info about the CI job that issued the HTTP requests we examined above. But others of these log events also give no CI job metadata either.

Here are all of the log events that lacked HTTP-related fields:

https://log.gprd.gitlab.net/goto/642a08ab2985abde4ece5dbd6f4cc237

Counting these log events by the field for CI job id, we find that many of them are missing that field.

https://log.gprd.gitlab.net/goto/d4c7c0b989dd8e07feb0255b4ad83ca2

https://log.gprd.gitlab.net/goto/9b87727d0d43ad3b726cc1bc6495acb4

For the log events that do specify a CI job id (json.extra.build_id), it seems that by 01:30 UTC we had stopped accumulating newly affected CI jobs.

https://log.gprd.gitlab.net/goto/c6334327eca32cd37116daf8857a60ae

How many distinct CI jobs logged CipherError exceptions?

https://log.gprd.gitlab.net/goto/9eacd57f302b7108113d0eb2750f785a

How many distinct projects logged CipherError exceptions for CI jobs?

https://log.gprd.gitlab.net/goto/33d7c6b70f01014885a6a491202c722f

So why do we have so many exception events that have neither HTTP nor CI job metadata?

My guess is that the exception was thrown too early in the event handling to have that extra metadata available. As supporting (but not definitive) evidence, we can see that prior to the point in time when we started deploying a new release candidate, all of the exceptions with no json.extra.* fields came from the older code in main stage, whereas the exceptions that did include json.extra.* fields came from cny stage.

For visual comparison, the following pair of graphs shows the number of exception log events that lack json.extra.* fields and HTTP-related fields. The 1st graph shows main stage and the 2nd graph shows canary stage. This shows that canary only started lacking the json.extra.* fields around 2021-01-28 10:00 UTC.

Main stage:

https://log.gprd.gitlab.net/goto/020c1589c066cdedd3567969da8e32b3

Canary stage:

https://log.gprd.gitlab.net/goto/08af5a27438a71eae1fdd26d620dc8c2

We can pick a few examples and look for other log events that share their correlation id, to gain extra context. This turned out to consistently match corresponding HTTP log events for the job trace endpoint. This was by far the most common endpoint we saw in the above review of HTTP exceptions, so it is possible that other endpoints also have correlated extra log events with little additional info.

https://log.gprd.gitlab.net/goto/edc73f886d4f770378df08115ab74f2f

Incident Review

Summary

Service(s) affected:
Team attribution:
Time to detection:
Minutes downtime or degradation:

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. ...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. ...
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

"5 Whys"

Incident Response Analysis

How was the incident detected?
1. ...
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. ...
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

Lessons Learned

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Feb 04, 2021 by Brent Newton