Token rotation and revocation code change caused DB lock contention and performance degradation
Incident Review
On 14th July at 20:28 UTC, a Posgres transaction timeout alert was raised, this resulted in a severity3 incident that was investigated and it was identified that the requests to /jwt/auth
were the primary contributors in the timeout queries. This was correlated with a large number of lock contention personal_access_token revocations UPDATE queries. A corresponding code change that introduced this query had been recently deployed (approx 19:49 UTC July 14th), and was proposed to be either reverted or placed behind a FF due to the risk of becoming a more serious DDoS issue over the weekend. At 11:30 UTC Release management indicated that the incident should be increased to severity2 in order deploy a change during changelock period. An MR was created and scheduled for subsequent RM deployment at 0:15 UTC
At 0:40 UTC 15th July, another BE engineer reviewing the change identified the potential scoping issue within the revocation query, and the risk of it revoking tokens wider than intended. This resulted in expediting the feature gating MR via a a hotpatch and review of delayed DB replica to identify the impact of revocations.
Due to the lock contention, a dip in Web apdex was also observed and an effort was made to purge UPDATE queries at regular internals that had already timed out and were holding the lock. This would provide relief to users hitting slow responses. This also impacted the ability to deploy the hotpatch. At 02:40 UTC the incident was elevated to severity1 due to the risk that data may have been lost plus the clearly visible impact it was having on gitlab.com.
At 2:53 UTC release managers were available to rollback the last deployment and the process was completed at 03:15 UTC. Full availability was restored, and the remainder of the time was spent to verify that no tokens were incorrectly revoked based on the offending query. This was done by validating the row address, update timestamps and the total number of tokens in each state between delayed replica and primary DB.
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Timeline
From incoming EOC
-
19:49
gprd deployment introducing the regression started16.2.202307141500-dd83dffcf1c.4170f54b059
-
20:10
increase on row-level lock contention on the DB. -
20:24
gprd deployment completed -
20:28
Postgres transactions showing high rate of statement timeouts
incident auto-opened. -
21:03
WebServiceLoadbalancerErrorSLOViolation with error-rate 0.1733%
alerted EOC -
21:08
incident declared
Jul-15-2023
-
00:15
placing path behind FF MR prepared👉 gitlab-org/gitlab!126600 (merged) and severity increased to severity2 -
00:33
risk of data-loss identified. -
00:55
user-facing errors increase -
01:13
initial statuspage incident created -
01:19
user-facing errors stop -
01:32
hotpatch initiated
-
01:45
first Status.io update -
02:11
user-facing errors increase -
02:19
user-facing errors stop -
02:32
user-facing errors increase -
02:44
severity increased to severity1 on account of performance degradation and data risk. -
02:46
user-facing errors stop -
02:49
canary is drained/chatops run canary --disable --production
. -
02:53
Deploy rollback started👉 https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/2115906 -
03:01
Web service apdex recovers -
03:15
Deploy rollback is finished
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers).
External customers and internal users. -
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...).
They saw reduced response from GitLab UI, pipeline failures and intermittent 503s. - How many customers were affected?. We are collecting the scope of users affected, currently it's expected 600-800 users saw a 5XX error for 25mins.
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?.
See above, in addition to users seeing a distinct error, 52472 users are expected to see a delayed response to their requests
What were the root causes?
- The root cause is attributed to CTE not being applied to the
update_all
function called in the query, used during the revocation. This in turn caused it to attempt to revoke a much greater (~4 million) number of tokens. Being a transactional operation it timed out trying to complete (& in turn was a no-op, i.e no tokens were revoked)
def pat_family(token)
# rubocop: disable CodeReuse/ActiveRecord
cte = Gitlab::SQL::RecursiveCTE.new(:personal_access_tokens_cte)
personal_access_token_table = Arel::Table.new(:personal_access_tokens)
cte << PersonalAccessToken
.where(personal_access_token_table[:previous_personal_access_token_id].eq(token.id))
cte << PersonalAccessToken
.from([personal_access_token_table, cte.table])
.where(personal_access_token_table[:previous_personal_access_token_id].eq(cte.table[:id]))
PersonalAccessToken.with.recursive(cte.to_arel).from(cte.alias_to(personal_access_token_table))
# rubocop: enable CodeReuse/ActiveRecord
end
token = PersonalAccessToken.last
pat_family(token).update_all(revoked: true)
# UPDATE ""personal_access_tokens"" SET ""revoked"" = TRUE WHERE (""personal_access_tokens"".""revoked"" = FALSE OR ""personal_access_tokens"".""revoked"" IS NULL) AND NOT ((""personal_access_tokens"".""revoked"" = FALSE OR ""personal_access_tokens"".""revoked"" IS NULL) AND ""personal_access_tokens"".""expires_at"" IS NOT NULL AND ""personal_access_tokens"".""expires_at"" <= '2023-07-14')
Incident Response Analysis
-
How was the incident detected?
The incident was detected via alerts on transactions timeouts: #16041 (closed)
-
How could detection time be improved?
The detection time for the alert was quick, it is also a severity3 alert that jumped from 1.4 timeouts/s to 8 timeouts/s. While a large jump, it is not known to be an immediate concern.
-
How was the root cause diagnosed?.
The root cause was diagnosed by initially looking at the offending endpoints with the lion share in the timeout statement i.e
/auth/jwt
. It was then correlated with bursts of row level lock contention in UPDATE statements on thepersonal_access_tokens
table. This led us to review the recently deployed change on token revocation. As part of the revert review, a BE engineer identified the root cause of the problem within the query where CTEs were not being taken into account for theupdate_all
operation leading to the operation running on whole personal_access_tokens table. -
How could time to diagnosis be improved?.
There were 2 primary levels of diagnosis within this incident. The first was to identify the offending change. This was done in 47 minutes since the declaration of the severity3 incident. The second was the diagnosis of the CTE's being not included, and was done 3.5 hours since the start of the incident. The focus was on mitigating the incident itself, and thus less time was spent diagnosing the query itself. For the purposes of this review, we will focus on time to identifying the offending change and the code change root cause will be remedied via corrective actions.
-
How did we reach the point where we knew how to mitigate the impact?.
With a manual review of lock contention locks in https://log.gprd.gitlab.net/goto/5f7c0670-2291-11ee-8580-8bba22f9f3dd for the following query
UPDATE "personal_access_tokens" SET "revoked" = $1 WHERE ("personal_access_tokens"."revoked" = $2 OR "personal_access_tokens"."revoked" IS NULL) AND NOT (("personal_access_tokens"."revoked" = $3 OR "personal_access_tokens"."revoked" IS NULL) AND "personal_access_tokens"."expires_at" IS NOT NULL AND "personal_access_tokens"."expires_at" <= $4)
For example:
UPDATE ""personal_access_tokens"" SET ""revoked"" = TRUE WHERE (""personal_access_tokens"".""revoked"" = FALSE OR ""personal_access_tokens"".""revoked"" IS NULL) AND NOT ((""personal_access_tokens"".""revoked"" = FALSE OR ""personal_access_tokens"".""revoked"" IS NULL) AND ""personal_access_tokens"".""expires_at"" IS NOT NULL AND ""personal_access_tokens"".""expires_at"" <= '2023-07-14')
We knew that the mitigation will require reverting the change or placing it behind a FF.
-
How could time to mitigation be improved?.
The primary impact of this incident was due to the time spent in mitigation. This could have been expedited by:
- The use of feature flags on any key database operations (in particular finder operations).
- Release guidance around options available to deploy a change faster. E.g a rollback that was considered much later in the investigation. This would also need a release manage on-call process for approvals.
- Improving documentation and familiarity with the hot patch process for EOCs
- Ensuring hot patcher works (currently broken - see corrective action: delivery#19482 (closed))
Post Incident Analysis
-
Did we have other events in the past with the same root cause?.
Investigating similar incidents though we have had regular query timeout events due to slow queries. -
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?. There are any ~"group::authentication and authorization" items in the backlog that would have reduced the impact of this incident directly.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.. This was triggered by a code change for the issue: Support Automatic Reuse Detection within the to... (gitlab-org/gitlab#395352 - closed)
What went well?
- The EOC was able to fairly quickly isolate the endpoints and the query that was causing lock contention
- We were also able to quickly identify the code change that introduced the offending query.
- EOC had expertise in the use of Posgres cli to purge locking operations along with comparing and validating data with the delayed replica DB. They also stayed much later after their shift to assist with the incident mitigation. There are action items to reduce this from happening in the future, but their effort greatly helped mitigate the incident faster.
- Similarly BE engineers from Auth and outside were able to assist with feature gating the change along with further debugging the root cause within the query.
- ...