2021-11-15: 404'ing JS file in staging

Current Status

An Asset is missing in staging is causing most pages to fail to load https://staging.gitlab.com/assets/webpack/runtime.f038baf3.bundle.js

Revert is waiting to be deployed.

Summary for CMOC notice / Exec summary:

Customer Impact: ServiceFrontend on Staging
Customer Impact Duration: 15:40 - end time UTC ( duration in minutes )
Current state: See Incident::<state> label
Root cause: RootCauseSoftware-Change

Timeline

All times UTC.

2021-11-15

15:02 - Package 14.5.202111151320-2153b254bc3.733f0be8b25 being deploy to staging
15:49 - First ask from Engineer pondering if the staging environment is okay
16:05 - First QA failures against staging being to emerge - https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/891093
16:56 - @bwill declares incident in Slack.
17:20 - @skarbek stops on-going deployment to production, note this was a different package
18:12 - Investigation discovers assets builds are not consistent between our omnibus packaging and container image packaging
18:17 - Determined that this is only a problem in the proposed package destined for staging and not the production rollout, therefore the production deploy is allowed to continue
20:23 - Suspected commit identified and targeted for rollback
20:47 - Revert is reviewed and accepted - this incident marked as mitigated

2021-11-16

07:47 - Deployment to gstg succeeds: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/891937.

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Deployer should pull a usable tag to deploy assets: delivery#2120 (closed)
GitLab should created reproducible asset filenames regardless of when they are build gitlab-org/gitlab#345874 (closed)
GitLab should improve the documentation for handling frontend dependencies: gitlab-org/gitlab#345875 (closed)

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. GitLab Engineering when using staging.gitlab.com
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Large sections of the UI fail to load
How many customers were affected?
1. 300ish
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. All calls to the UI of staging.gitlab.com would have been impacted

What were the root causes?

Asset Build

Autodeploy package 14.5.202111151320-2153b254bc3.733f0be8b25 contained some change that caused assets to be built differently between our container image and omnibus package build pipelines. The assets that compiled with different names:

sprockets-manifest-<sha>.json
webpack/commons-pages.subscriptions.buy_minutes-pages.subscriptions.buy_storage.<sha>.chunk.js
webpack/init_hand_raise_lead_button.<sha>.chunk.js
webpack/runtime.<sha>.bundle.js

There are at least 4 reasons that would cause assets to be built differently despite the code base being the same:

Any dependency change
Any change in a file itself
Any change in a linked file (module)
Weak hash function in webpack 4 or bad dependency which forces the new hash each time we do a build.

We suspect that last one is what is biting us hard here.

Asset Build Procedures

We build assets multiple times. For this particular package that first alerted us to this problem, here's the two times for which we build:

Omnibus doesn't build them itself, instead, it yanks them from the asset image that gets created by the gitlab repo: https://dev.gitlab.org/gitlab/omnibus-gitlab/-/jobs/11325719

Asset Deployment

Static assets are uploaded to object storage and are reverse proxied through our HAProxy. During the deploy, one of our first steps is to upload these assets to object storage. This is done through a deploy job called <env>-assets. Deployer runs a script called fetch-and-upload-assets which will first look for the assets docker image that we discussed prior, however, Deployer doesn't have the correct information, and therefore always uses its backup method of uploading assets, which is from the omnibus built package. This in itself is a bug, and should be fixed, therefore it's one of the corrective actions.

Incident Response Analysis

How was the incident detected?
1. An engineer was the first flag, through QA eventually started to show signs of failing
How could detection time be improved?
1. Perhaps QA failures should alert us at the first sign of failure instead of after all configured retries?
How was the root cause diagnosed?
1. Downloading the assets from both the CNG image and the omnibus package and comparing the ones who's named differed. It was determined that the content of the files were the same.
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. A commit was identified to have introduced an inconsistent build of some filenames of our assets and was chosen to be rolled back.
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. unknown
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. RootCauseSoftware-Change - gitlab-org/gitlab!74013 (merged)

Lessons Learned

We discovered that this may have been a problem for quite awhile. Looking back at a few builds leading up to the first failure uncovered that at least one other file was different, sprockets-manifest-<sha>.json
...

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Nov 17, 2021 by Vitaly Slobodin