Token rotation and revocation code change caused DB lock contention and performance degradation

Incident Review

On 14th July at 20:28 UTC, a Posgres transaction timeout alert was raised, this resulted in a severity3 incident that was investigated and it was identified that the requests to /jwt/auth were the primary contributors in the timeout queries. This was correlated with a large number of lock contention personal_access_token revocations UPDATE queries. A corresponding code change that introduced this query had been recently deployed (approx 19:49 UTC July 14th), and was proposed to be either reverted or placed behind a FF due to the risk of becoming a more serious DDoS issue over the weekend. At 11:30 UTC Release management indicated that the incident should be increased to severity2 in order deploy a change during changelock period. An MR was created and scheduled for subsequent RM deployment at 0:15 UTC

At 0:40 UTC 15th July, another BE engineer reviewing the change identified the potential scoping issue within the revocation query, and the risk of it revoking tokens wider than intended. This resulted in expediting the feature gating MR via a a hotpatch and review of delayed DB replica to identify the impact of revocations.

Due to the lock contention, a dip in Web apdex was also observed and an effort was made to purge UPDATE queries at regular internals that had already timed out and were holding the lock. This would provide relief to users hitting slow responses. This also impacted the ability to deploy the hotpatch. At 02:40 UTC the incident was elevated to severity1 due to the risk that data may have been lost plus the clearly visible impact it was having on gitlab.com.

At 2:53 UTC release managers were available to rollback the last deployment and the process was completed at 03:15 UTC. Full availability was restored, and the remainder of the time was spent to verify that no tokens were incorrectly revoked based on the offending query. This was done by validating the row address, update timestamps and the total number of tokens in each state between delayed replica and primary DB.

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Timeline

From incoming EOC

19:49 gprd deployment introducing the regression started 16.2.202307141500-dd83dffcf1c.4170f54b059
20:10 increase on row-level lock contention on the DB.
20:24 gprd deployment completed
20:28 Postgres transactions showing high rate of statement timeouts incident auto-opened.
21:03 WebServiceLoadbalancerErrorSLOViolation with error-rate 0.1733% alerted EOC
21:08 incident declared

Jul-15-2023

00:15 placing path behind FF MR prepared 👉 gitlab-org/gitlab!126600 (merged) and severity increased to severity2
00:33 risk of data-loss identified.
00:55 user-facing errors increase
01:13 initial statuspage incident created
01:19 user-facing errors stop
01:32 hotpatch initiated
01:45 first Status.io update
02:11 user-facing errors increase
02:19 user-facing errors stop
02:32 user-facing errors increase
02:44 severity increased to severity1 on account of performance degradation and data risk.
02:46 user-facing errors stop
02:49 canary is drained /chatops run canary --disable --production.
02:53 Deploy rollback started 👉 https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/2115906
03:01 Web service apdex recovers
03:15 Deploy rollback is finished

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers).
External customers and internal users.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...).
They saw reduced response from GitLab UI, pipeline failures and intermittent 503s.
How many customers were affected?. We are collecting the scope of users affected, currently it's expected 600-800 users saw a 5XX error for 25mins.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?.
See above, in addition to users seeing a distinct error, 52472 users are expected to see a delayed response to their requests

What were the root causes?

The root cause is attributed to CTE not being applied to the update_all function called in the query, used during the revocation. This in turn caused it to attempt to revoke a much greater (~4 million) number of tokens. Being a transactional operation it timed out trying to complete (& in turn was a no-op, i.e no tokens were revoked)

def pat_family(token)
  # rubocop: disable CodeReuse/ActiveRecord
  cte = Gitlab::SQL::RecursiveCTE.new(:personal_access_tokens_cte)
  personal_access_token_table = Arel::Table.new(:personal_access_tokens)

  cte << PersonalAccessToken
            .where(personal_access_token_table[:previous_personal_access_token_id].eq(token.id))
  cte << PersonalAccessToken
            .from([personal_access_token_table, cte.table])
            .where(personal_access_token_table[:previous_personal_access_token_id].eq(cte.table[:id]))
  PersonalAccessToken.with.recursive(cte.to_arel).from(cte.alias_to(personal_access_token_table))
  # rubocop: enable CodeReuse/ActiveRecord
end

token = PersonalAccessToken.last

pat_family(token).update_all(revoked: true)

# UPDATE ""personal_access_tokens"" SET ""revoked"" = TRUE WHERE (""personal_access_tokens"".""revoked"" = FALSE OR ""personal_access_tokens"".""revoked"" IS NULL) AND NOT ((""personal_access_tokens"".""revoked"" = FALSE OR ""personal_access_tokens"".""revoked"" IS NULL) AND ""personal_access_tokens"".""expires_at"" IS NOT NULL AND ""personal_access_tokens"".""expires_at"" <= '2023-07-14')

Incident Response Analysis

How was the incident detected?

The incident was detected via alerts on transactions timeouts: #16041 (closed)
How could detection time be improved?

The detection time for the alert was quick, it is also a severity3 alert that jumped from 1.4 timeouts/s to 8 timeouts/s. While a large jump, it is not known to be an immediate concern.
How was the root cause diagnosed?.

The root cause was diagnosed by initially looking at the offending endpoints with the lion share in the timeout statement i.e /auth/jwt. It was then correlated with bursts of row level lock contention in UPDATE statements on the personal_access_tokens table. This led us to review the recently deployed change on token revocation. As part of the revert review, a BE engineer identified the root cause of the problem within the query where CTEs were not being taken into account for the update_all operation leading to the operation running on whole personal_access_tokens table.
How could time to diagnosis be improved?.

There were 2 primary levels of diagnosis within this incident. The first was to identify the offending change. This was done in 47 minutes since the declaration of the severity3 incident. The second was the diagnosis of the CTE's being not included, and was done 3.5 hours since the start of the incident. The focus was on mitigating the incident itself, and thus less time was spent diagnosing the query itself. For the purposes of this review, we will focus on time to identifying the offending change and the code change root cause will be remedied via corrective actions.

How did we reach the point where we knew how to mitigate the impact?.

With a manual review of lock contention locks in https://log.gprd.gitlab.net/goto/5f7c0670-2291-11ee-8580-8bba22f9f3dd for the following query

UPDATE "personal_access_tokens" SET "revoked" = $1 WHERE ("personal_access_tokens"."revoked" = $2 OR "personal_access_tokens"."revoked" IS NULL) AND NOT (("personal_access_tokens"."revoked" = $3 OR "personal_access_tokens"."revoked" IS NULL) AND "personal_access_tokens"."expires_at" IS NOT NULL AND "personal_access_tokens"."expires_at" <= $4)

For example:

UPDATE ""personal_access_tokens"" SET ""revoked"" = TRUE WHERE (""personal_access_tokens"".""revoked"" = FALSE OR ""personal_access_tokens"".""revoked"" IS NULL) AND NOT ((""personal_access_tokens"".""revoked"" = FALSE OR ""personal_access_tokens"".""revoked"" IS NULL) AND ""personal_access_tokens"".""expires_at"" IS NOT NULL AND ""personal_access_tokens"".""expires_at"" <= '2023-07-14')

We knew that the mitigation will require reverting the change or placing it behind a FF.

How could time to mitigation be improved?.

The primary impact of this incident was due to the time spent in mitigation. This could have been expedited by:

The use of feature flags on any key database operations (in particular finder operations).
Release guidance around options available to deploy a change faster. E.g a rollback that was considered much later in the investigation. This would also need a release manage on-call process for approvals.
Improving documentation and familiarity with the hot patch process for EOCs
Ensuring hot patcher works (currently broken - see corrective action: delivery#19482 (closed))

Post Incident Analysis

Did we have other events in the past with the same root cause?.
Investigating similar incidents though we have had regular query timeout events due to slow queries.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?. There are any ~"group::authentication and authorization" items in the backlog that would have reduced the impact of this incident directly.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.. This was triggered by a code change for the issue: Support Automatic Reuse Detection within the to... (gitlab-org/gitlab#395352 - closed)

What went well?

The EOC was able to fairly quickly isolate the endpoints and the query that was causing lock contention
We were also able to quickly identify the code change that introduced the offending query.
EOC had expertise in the use of Posgres cli to purge locking operations along with comparing and validating data with the delayed replica DB. They also stayed much later after their shift to assist with the incident mitigation. There are action items to reduce this from happening in the future, but their effort greatly helped mitigate the incident faster.
Similarly BE engineers from Auth and outside were able to assist with feature gating the change along with further debugging the root cause within the query.
...

Guidelines

Blameless RCA Guideline

Edited Jul 18, 2023 by Adil Farrukh