2022-05-24: Receiving 500 error when navigating to Security Configuration on projects
Incident Roles
The DRI for this incident is the incident issue assignee, see roles and responsibilities.
Roles when the incident was declared:
- Incident Manager (IMOC): @donaldcook
- Engineer on-call (EOC): @nnelson
Current Status
Revert has MR in question: gitlab-org/gitlab!88459 (merged). Revert has resolved issue.
There are many projects which are effected by this, but the scope of this incident does not appear to be widespread on GitLab.com. About 480
errors per hour at peak.
We are seeing reports of 500 errors when visiting the Security Configuration page within projects ([project]/-/security/configuration
).
Error message is:
Use `License.feature_available?` for features that cannot be restricted to only a subset of projects or namespaces
Error incidence profile looks like this:
Source: kibana
More information will be added as we investigate the issue. For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.
Summary for CMOC notice / Exec summary:
- Customer Impact:
Likely most users
- Service Impact: ServiceGitLab Rails
- Impact Duration:
2022-05-23 0500 utc - 2022-05-25 3:14 am UTC
- Root cause: RootCauseSoftware-Change
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
- Gitlab.com Latest Updates
All times UTC.
2022-05-24
-
17:54
- @tristan declares incident in Slack. -
18:20
- Believe root cause MR has been found. Revert has been initiated: gitlab-org/gitlab!88459 (merged)
2022-05-25
-
03:00
- Revert has been deployed to production, no instances of error in past 35 minutes. Updated to IncidentResolved
Create related issues
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
Takeaways
- ...
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Institute Feature Change Lock for Threat Insights Frontend team to evaluate and identify areas for improvement and future prevention. gitlab-org/gitlab#363602 (closed)
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- DotCom External customers attempting to use the configuration UI to set up security scanning configuration.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- The page didn't load and displayed a 500 error.
-
How many customers were affected?
- Unknown
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- scope of this incident does not appear to be widespread on GitLab.com. About
480
errors per hour at peak.
- scope of this incident does not appear to be widespread on GitLab.com. About
What were the root causes?
- MR gitlab-org/gitlab!86639 (merged) introduced a bug where
project.licensed_feature_available?(:security_training)
was incorrectly called in the GitLab.com environment.
Incident Response Analysis
-
How was the incident detected?
- Sentry stack trace error 2022-05-23 11:50pm UTC - bug issue created at gitlab-org/gitlab#363267 (closed)
- End to end test failing 2022-05-24 9:52am UTC - issue created at gitlab-org/gitlab#363313 (closed)
- Customer reported 2022-05-24- 6:54pm UTC - incident created
-
How could detection time be improved?
- Run E2E tests during merge, and not just as a downstream/scheduled pipeline.
- Broaden audience for Sentry errors.
-
How was the root cause diagnosed?
- Timing and backtrace identifies MR gitlab-org/gitlab!86639 (merged) to be related change.
-
How could time to diagnosis be improved?
- Time to diagnosis was <5 minutes after incident was created.
-
How did we reach the point where we knew how to mitigate the impact?
-
Stack trace specifically called out new code
licensed_feature_available
added in gitlab-org/gitlab!86639 (merged)
-
Stack trace specifically called out new code
-
How could time to mitigation be improved?
- Revert MR gitlab-org/gitlab!88459 (merged) merged on 2022-05-24 7:24pm UTC and hit production/.com at 2022-05-25 3:14am UTC. Can we shorten this to <8 hours?
-
What went well?
- Incident response team, IMCO response. Slack channel and issue were opened quickly after customer report. Causing MR was identified and reverted quickly.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No. This was an isolated issue.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No. This was caused by an iterative issue which moves license check logic to the backend.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Yes. Issue: gitlab-org/gitlab#358183 (closed), MR: gitlab-org/gitlab!86639 (merged)
What went well?
- We quickly discovered the cause MR. A revert MR was created promptly.
- Following the reversion, the team continued to discuss and determine gaps in fully testing the code change.
- Cross-functional collaboration between Development and Quality identified a shared sentiment where our Staging test window is short, and we are resource constrained on running full tests earlier in the workflow.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)