2021-11-15: 404'ing JS file in staging
Current Status
An Asset is missing in staging is causing most pages to fail to load https://staging.gitlab.com/assets/webpack/runtime.f038baf3.bundle.js
Revert is waiting to be deployed.
Summary for CMOC notice / Exec summary:
- Customer Impact: ServiceFrontend on Staging
- Customer Impact Duration: 15:40 - end time UTC ( duration in minutes )
- Current state: See
Incident::<state>
label - Root cause: RootCauseSoftware-Change
Timeline
All times UTC.
2021-11-15
-
15:02
- Package14.5.202111151320-2153b254bc3.733f0be8b25
being deploy to staging -
15:49
- First ask from Engineer pondering if the staging environment is okay -
16:05
- First QA failures against staging being to emerge - https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/891093 -
16:56
- @bwill declares incident in Slack. -
17:20
- @skarbek stops on-going deployment to production, note this was a different package -
18:12
- Investigation discovers assets builds are not consistent between our omnibus packaging and container image packaging -
18:17
- Determined that this is only a problem in the proposed package destined for staging and not the production rollout, therefore the production deploy is allowed to continue -
20:23
- Suspected commit identified and targeted for rollback -
20:47
- Revert is reviewed and accepted - this incident marked as mitigated
2021-11-16
-
07:47
- Deployment to gstg succeeds: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/891937.
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Deployer should pull a usable tag to deploy assets: delivery#2120 (closed)
- GitLab should created reproducible asset filenames regardless of when they are build gitlab-org/gitlab#345874 (closed)
- GitLab should improve the documentation for handling frontend dependencies: gitlab-org/gitlab#345875 (closed)
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- GitLab Engineering when using staging.gitlab.com
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Large sections of the UI fail to load
-
How many customers were affected?
- 300ish
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- All calls to the UI of staging.gitlab.com would have been impacted
What were the root causes?
Asset Build
Autodeploy package 14.5.202111151320-2153b254bc3.733f0be8b25
contained some change that caused assets to be built differently between our container image and omnibus package build pipelines. The assets that compiled with different names:
sprockets-manifest-<sha>.json
webpack/commons-pages.subscriptions.buy_minutes-pages.subscriptions.buy_storage.<sha>.chunk.js
webpack/init_hand_raise_lead_button.<sha>.chunk.js
webpack/runtime.<sha>.bundle.js
There are at least 4 reasons that would cause assets to be built differently despite the code base being the same:
- Any dependency change
- Any change in a file itself
- Any change in a linked file (module)
- Weak hash function in webpack 4 or bad dependency which forces the new hash each time we do a build.
We suspect that last one is what is biting us hard here.
Asset Build Procedures
We build assets multiple times. For this particular package that first alerted us to this problem, here's the two times for which we build:
-
gitlab
- https://gitlab.com/gitlab-org/security/gitlab/-/jobs/1783226982 -
CNG
- https://dev.gitlab.org/gitlab/charts/components/images/-/jobs/11325710
Omnibus doesn't build them itself, instead, it yanks them from the asset image that gets created by the gitlab
repo: https://dev.gitlab.org/gitlab/omnibus-gitlab/-/jobs/11325719
Asset Deployment
Static assets are uploaded to object storage and are reverse proxied through our HAProxy. During the deploy, one of our first steps is to upload these assets to object storage. This is done through a deploy job called <env>-assets
. Deployer runs a script called fetch-and-upload-assets which will first look for the assets docker image that we discussed prior, however, Deployer doesn't have the correct information, and therefore always uses its backup method of uploading assets, which is from the omnibus built package. This in itself is a bug, and should be fixed, therefore it's one of the corrective actions.
Incident Response Analysis
-
How was the incident detected?
- An engineer was the first flag, through QA eventually started to show signs of failing
-
How could detection time be improved?
- Perhaps QA failures should alert us at the first sign of failure instead of after all configured retries?
-
How was the root cause diagnosed?
- Downloading the assets from both the CNG image and the omnibus package and comparing the ones who's named differed. It was determined that the content of the files were the same.
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- A commit was identified to have introduced an inconsistent build of some filenames of our assets and was chosen to be rolled back.
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- unknown
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
Lessons Learned
- We discovered that this may have been a problem for quite awhile. Looking back at a few builds leading up to the first failure uncovered that at least one other file was different,
sprockets-manifest-<sha>.json
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)