2022-08-23: locked projects warning showing unexpectedly
Incident Roles
The DRI for this incident is the incident issue assignee, see roles and responsibilities.
Roles when the incident was declared:
- Incident Manager (IMOC): @grzesiek,
@splattael (Shadow)
,@mksionek (Shadow)
- Engineer on-call (EOC): @rehab
Current Status
This incident and all related impacts have been resolved.
During the incident some users may have noticed that the reported size of their repositories increased. These values were later corrected.
During the active impact time some users were presented with a banner message indicating that they had "locked projects" which had exceeded their repository storage limits. These displayed a banner similar to:
.
Summary for CMOC notice / Exec summary:
- Customer Impact: Users were presented with a banner warning if they has any projects in their namespace which exceeded the repository storage limit. Operations to these specific projects would also fail as the projects would be locked.
- Service Impact: ServiceGitaly
- Impact Duration: 1329 UTC - 1540 UTC (2 hrs 11 mins)
- Root cause: RootCauseConfig-Change
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
- Gitlab.com Latest Updates
All times UTC.
2022-08-23
- 04:54 UTC -
projects_build_artifacts_size_refresh
FF set totrue
. - 13:29 UTC -
gitaly_catfile_repo_size
feature flag set totrue
👉 https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues/13379. - 13:52 UTC - First report of the issue shared in #production (internal link).
- 14:07 UTC - Rehab declares incident in Slack.
- 14:14 UTC - Severity raised to severity2
- 14:30 UTC - Grzegorz joined as IMOC.
- 14:31 UTC - Severity raised to severity1
- 14:35 UTC - Merge request that potentially introduced the change in how we calculate project storage size has been identified.
- 14:37 UTC - decision to turn off the
projects_build_artifacts_size_refresh
feature flag has been made. - 14:35 UTC - Communication through gitlabstatus twitter started
- 14:39 UTC - FF disabled
projects_build_artifacts_size_refresh
- 14:40 UTC - Grzegorz reached out for help in pipeline insights / authoring Slack channels.
- 14:45 UTC - SMEs from Verify area provided feedback about the artifacts refresh size worker.
- 15:07 UTC - Discussion about potential mitigation started.
- 15:07 UTC - Grzegorz to Steve IMOC handover.
- 15:17 UTC - disabled the
gitaly_catfile_repo_size
feature flag - 15:40 UTC - implemented temporary limit change as a mitigation.
- 20:50 UTC - completed reset of gitaly repo storage values
- 20:59 UTC - reverted limit mitigation.
2022-08-24
- 17:51 UTC -
projects_build_artifacts_size_refresh
was set totrue
(after it was determined it's unrelated to the cause of the incident).
Create related issues
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
Takeaways
- Production FF status can be altered from channels outside of #production.
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- https://gitlab.com/gitlab-com/Product/-/issues/4755
- https://gitlab.com/gitlab-org/gitlab/-/issues/371671
- https://gitlab.com/gitlab-org/gitlab/-/issues/371673
- https://gitlab.com/gitlab-org/gitlab/-/issues/371674
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16254+
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16261+
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External and internal customers and users of gitlab.com.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Users were were locked out of some of their repositories, this included being unable to push commits and other write actions.
-
How many customers were affected?
- 81,236 projects were affected.
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
- Logic update to the way repo size is calculated.
Incident Response Analysis
-
How was the incident detected?
- User reports of banner warning.
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- Looking at recent FF toggles.
- How could time to diagnosis be improved?
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No (at least not none that I'm aware of)
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Yes, see #corrective-actions
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
What went well?
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)