Hot-patcher feedback from recent incident

A few bits of feedback from a use of the hot patcher today for a high priority/severity security incident, all regarding kubernetes:

  1. https://gitlab.com/gitlab-org/release/docs/-/blob/master/runbooks/sev-1-incident.md#remediation-options says it is out of scope, but the patcher has jobs for k8s.
  2. However, those jobs did not succeed:
    1. On staging they reported Check https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines to see if there are any outstanding configuration changes, to override this failure set IMAGE_CHECK_OVERRIDE as a CI variable; the image had changed, but gitlabVersion in ghte gitlab-gitlab-chart-info Configmap also changed, which it didn't like. Setting IMAGE_CHECK_OVERRIDE=1 in https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/ had no effect and it failed. I have removed that variable again. See https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/367369; retried https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/jobs/2500853 after setting the variable.
    2. https://ops.gitlab.net/gitlab-com/gl-infra/patcher/-/jobs/2500757 (gprd-cy) failed with "No cluster named 'gprd-cny-gitlab-gke' in gitlab-production." Probably not important, just noting it.

In the end we didn't need the k8s part for this fix; deploying to web + api was sufficient (empirical testing; I don't have a firm grasp of the code paths and where they execute, and am a little surprised but happy it worked).

Assignee Loading
Time tracking Loading