2022-02-17: File uploads fail with a 404 response when using Replace method
Incident DRI
Current Status
When the feature flag :refactor_blob_viewer
was set on Feb 15th a bug in the application code was exposed that prevented users from using the "Replace file" feature. The first support ticket was submited Feb 16th at 10:47UTC and three more support tickets followed.
While this is not a very popular feature of the application, it resulted in several thousand failed requests over 7 days and 3 support tickets.
For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.
Summary for CMOC notice / Exec summary:
- Customer Impact: Users were not able to replace files, using the "Replace file" feature because it was making the request to the wrong URL.
- Service Impact: ServiceWeb
- Impact Duration: 2 days 10 hours 59 minutes (3539 minutes)
- Root cause: Rollout of
refactor_blob_viewer
Timeline
-- https://log.gprd.gitlab.net/goto/6e998750-9033-11ec-9dd2-93d354bef8e7
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
All times UTC.
2022-02-14
-
15:33
-refactor_blob_viewer
feature flag set totrue
on production
2022-02-16
-
10:47
Customer ticket raised #269040 -
13:33
Customer ticket raised #269089 -
20:28
Customer ticket raised #269225
2022-02-17
-
01:48
- @jamesreed declares incident in Slack. -
02:03
- @cindy raises to Dev Escalation. @engwan responded -
02:28
- @engwan disables the relevant feature flagrefactor_blob_viewer
on production -
02:29
- Status page initial communication of the problem - investigating -
02:32
- Incident mitigated -
02:32
- Status page update - incident resolved
Takeaways
- ...
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Create a comprehensive test plan that we can use to manually run through all blob viewers and blob-related functionality - gitlab-org/gitlab#353393 (closed) DRI @mlapierre
- Ensure that user_replaces_files_spec.rb, and user_deletes_files_spec.rb are tested with the feature flag enabled - gitlab-org/gitlab#349953 (closed) DRI @jerasmus
- Ensure
spec/features/projects/files/user_deletes_files_spec.rb
runs with the feature enabled - gitlab-org/gitlab#349953 (closed) DRI @jerasmus - Ensure
spec/features/markdown/copy_as_gfm_spec.rb
runs with the feature enabled - gitlab-org/gitlab#350454 (closed) DRI @jerasmus - Revisit when incidents warrant external communication: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15289
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External customers. From Kibana data, we see 586 users potentially affected
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- ...
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- 3,736 replace requests was routed to 404 response
What were the root causes?
- Rollout of
refactor_blob_viewer
- The bug was not caught in QA
- The feature flag is disabled in tests
Incident Response Analysis
-
How was the incident detected?
- Internal report from
is-this-known
channel, and customer tickets
- Internal report from
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
- What went well?
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
What went well?
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)