Skip to content

Recent broken master incidents blocking our ability to auto-deploy

Since Friday last week 20:21 UTC, the auto-deploy (build and deploy) has been getting blocked due to the broken master incidents.

Here is the compilation of the master-broken incidents that halted our ability to build packages and deploy with the timeline (from RMs' POV)


Friday (2025-10-17):

image.png

  • This graph shows the increase in the build pressure on Friday until Monday.
  • Based on this graph, looks like there were 2 broken-master incidents that happened during that day.
  • The (1st) one (I am still trying to find) was fixed late EMEA.
  • The second incident (2nd) starts late AMER and continues until Monday.
    • It is caused by a Rubocop failure.
  • Effects on the auto-deploy: At 5:49 PM, our AMER RM noticed that the auto-deploy pressure is not decreasing even if we are deploying continuously. (Internal Slack thread).
  • Turns out, that alert pertains to the unpackaged commits (not undeployed commits); the build pressure is increasing due to a broken master.
  • The result: The packages being built have no changes from the GitLab rails code, only with the CNG and Omnibus. This is an example package: 18.6.202510171351
  • The rubocop failure broken master originates from this MR: gitlab-org/gitlab!205020 (merged).
  • It was later fixed in this MR: gitlab-org/gitlab!209462 (merged).
    • This was merged on 2025-10-20 13:26

Monday (2025-10-20)

image.png

  • This graph shows the long-running master broken incident since Friday
  • Fortunately, there were some packages built on Friday (in between the two master-broken incidents), so we were able to deploy some changes.
  • Another master broken incident (3rd) started on Sunday due to a failing rspec.
    • This coincides with the ongoing (2nd) master-broken incident of Friday
    • The fix for this (3rd) was merged on 2025-10-20 05:23
  • The fix for the second master-broken incident on Friday was finally merged on 2025-10-20 13:26
  • Noting that this same day, Slack was down, it affected us from noticing the issue.
  • Another master-broken incident (4th) started at 2025-10-20 05:39, this is due to a failing rspec

Tuesday (2025-10-21)

image.png


Notes:

  • We currently have a ReleaseManagementNumberOfUnpackagedCommits metric and alert setup for the unpackaged commits (build pressure). There is also another ReleaseManagementNumberOfUndeployedCommits metric and alert setup for the undeployed commits (deploy pressure). The threshold for both alerts is 100.

  • Graph of the ReleaseManagementNumberOfUnpackagedCommits alert (Source)

    image

  • These master-broken incidents blocked our tooling from packaging the already merged commits since our tooling requires a green master when building a package.

  • These also blocked us from merging the security MRs to be included in the patch release last Wednesday

    • Fortunately, we are able to merge and deploy them in time, and the patch release was not delayed.
    • That patch release contains severity2 security fixes which cannot be delayed.
Edited by Maina Ng'ang'a