2021-09-03: Users getting 422 errors when trying to log in

Current Status

We are receiving reports of using receiving a 422 when attempting to log in to Gitlab.com.

Timeline

View recent production deployment and configuration events / gcp events (internal only)

All times UTC.

2021-09-03

04:08 - First support ticket comes in
05:40 - Support notices an influx of support tickets beginning to come in and asks in #production
05:50 - @manojmj discovers multiple errors in Kibana and identifies the relation to a recently deployed MR
05:59 - @ggillies declares incident in Slack.
06:07 - @manojmj submits a revert MR
08:36 - Revert MR is picked into autodeploy
15:00 - The revert MR is in production and the incident is resolved

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

# Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Users that have an invalid URL in the website url field in their profile upon login (including URLs which were not preceded by https:// or http://
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Users were unable to login and there was no workaround for them available
How many customers were affected?
1. We had 2603 failed login attempts over 2 days: https://log.gprd.gitlab.net/goto/c68d55cfcb29b365ab499345378b368c
2. We don't know if all of those were uniq users attempting to login, as we don't know the user when the exception is occuring. Counting by unique IPs we see many IPs with multiple attempts - which could be many users behind the same NAT IP of course: https://log.gprd.gitlab.net/goto/cfc796953511e01a18e46291316d7b8f, so we can't tell exactly how many users were affected, but it must have been less than 2603.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

A new validation was added to the User model which would fail for users that already have invalid website urls in their profile. And when they try to login, user.save would fail, resulting in a 422 error.

Incident Response Analysis

How was the incident detected?
1. An influx of customer support tickets
How could detection time be improved?
1. Possibly through alerting on an increased number of login errors, or unusual login behaviours. According to the timeline, we responded to customer tickets rather than being able to be reactive to the error.
How was the root cause diagnosed?
1. On checking Kibana for recent 422 errors, I noticed the error message json.exception.message Validation failed: Website url is not a valid URL has been reported for quite some time now and I wondered if this could be the root cause of the problems being reported by customers.
2. In one of the customer support tickets raised around "not being able to login", the customer had attached a screenshot and I observed that the page showed a validation error of 1 error prohibited the user from being saved: Website url is not a valid url in the UI.
3. These 2 reasons combined lead me to believe that a validation on website_url attribute was indeed the problem.
4. This lead me into checking if any new validations were added to the website_url attribute recently.
5. For checking this, I checked history of user.rb from this page.
6. Which lead me to the MR: gitlab-org/gitlab!69436 (merged), which added the validation recently.
How could time to diagnosis be improved?
1. @manojmj: We were able to diagnose the problem pretty quickly because we could co-relate the error messages from the customer tickets, the error messages we were seeing in Kibana and the fact that the MR gitlab-org/gitlab!69436 (merged) had hit production very recently. I cannot really think of anything that could have been done differently so that we could have drastically improved the time to diagnosis after the incident happened.
How did we reach the point where we knew how to mitigate the impact?
1. Since the validation added in gitlab-org/gitlab!69436 (merged) was just a one-liner (and its corresponding test), the change itself was very small. Which meant that if we reverted it, the revert would also be small and low risk (because all it was doing was adding a new validation) - and if we reverted the validation, we knew that the particular error that customers were seeing then and preventing their login would go away. Which is how we quickly decided on reverting gitlab-org/gitlab!69436 (merged) to mitigate the problem.
How could time to mitigation be improved? 1.
What went well?
1. We responded quickly to revert the change. There are some good suggestions to mitigate the root cause going forwards: #5473 (comment 670089577)

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes, this was a code change to resolve a security issue which was being worked as part of the Access Security Burndown initiative. This issue was approved by AppSec to be worked in the public repo.

Lessons Learned

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Sep 06, 2021 by Henri Philipps