2021-09-03: Users getting 422 errors when trying to log in
Current Status
We are receiving reports of using receiving a 422 when attempting to log in to Gitlab.com.
Timeline
View recent production deployment and configuration events / gcp events (internal only)
All times UTC.
2021-09-03
-
04:08
- First support ticket comes in -
05:40
- Support notices an influx of support tickets beginning to come in and asks in #production -
05:50
- @manojmj discovers multiple errors in Kibana and identifies the relation to a recently deployed MR -
05:59
- @ggillies declares incident in Slack. -
06:07
- @manojmj submits a revert MR -
08:36
- Revert MR is picked into autodeploy -
15:00
- The revert MR is in production and the incident is resolved
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Update guidelines for contributors and code reviewers regarding validations
- Create a Rubocop rule for new validations
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
# Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Users that have an invalid URL in the
website url
field in their profile upon login (including URLs which were not preceded byhttps://
orhttp://
- Users that have an invalid URL in the
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Users were unable to login and there was no workaround for them available
-
How many customers were affected?
- We had 2603 failed login attempts over 2 days: https://log.gprd.gitlab.net/goto/c68d55cfcb29b365ab499345378b368c
- We don't know if all of those were uniq users attempting to login, as we don't know the user when the exception is occuring. Counting by unique IPs we see many IPs with multiple attempts - which could be many users behind the same NAT IP of course: https://log.gprd.gitlab.net/goto/cfc796953511e01a18e46291316d7b8f, so we can't tell exactly how many users were affected, but it must have been less than 2603.
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
- A new validation was added to the
User
model which would fail for users that already have invalid website urls in their profile. And when they try to login,user.save
would fail, resulting in a 422 error.
Incident Response Analysis
-
How was the incident detected?
- An influx of customer support tickets
-
How could detection time be improved?
- Possibly through alerting on an increased number of login errors, or unusual login behaviours. According to the timeline, we responded to customer tickets rather than being able to be reactive to the error.
-
How was the root cause diagnosed?
- On checking Kibana for recent 422 errors, I noticed the error message
json.exception.message Validation failed: Website url is not a valid URL
has been reported for quite some time now and I wondered if this could be the root cause of the problems being reported by customers. - In one of the customer support tickets raised around "not being able to login", the customer had attached a screenshot and I observed that the page showed a validation error of
1 error prohibited the user from being saved: Website url is not a valid url
in the UI. - These 2 reasons combined lead me to believe that a validation on
website_url
attribute was indeed the problem. - This lead me into checking if any new validations were added to the
website_url
attribute recently. - For checking this, I checked history of
user.rb
from this page. - Which lead me to the MR: gitlab-org/gitlab!69436 (merged), which added the validation recently.
- On checking Kibana for recent 422 errors, I noticed the error message
-
How could time to diagnosis be improved?
-
@manojmj
: We were able to diagnose the problem pretty quickly because we could co-relate the error messages from the customer tickets, the error messages we were seeing in Kibana and the fact that the MR gitlab-org/gitlab!69436 (merged) had hit production very recently. I cannot really think of anything that could have been done differently so that we could have drastically improved the time to diagnosis after the incident happened.
-
-
How did we reach the point where we knew how to mitigate the impact?
- Since the validation added in gitlab-org/gitlab!69436 (merged) was just a one-liner (and its corresponding test), the change itself was very small. Which meant that if we reverted it, the revert would also be small and low risk (because all it was doing was adding a new validation) - and if we reverted the validation, we knew that the particular error that customers were seeing then and preventing their login would go away. Which is how we quickly decided on reverting gitlab-org/gitlab!69436 (merged) to mitigate the problem.
- How could time to mitigation be improved? 1.
-
What went well?
- We responded quickly to revert the change. There are some good suggestions to mitigate the root cause going forwards: #5473 (comment 670089577)
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Yes, this was a code change to resolve a security issue which was being worked as part of the Access Security Burndown initiative. This issue was approved by AppSec to be worked in the public repo.
Lessons Learned
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)