Rollout Google Container Optimized OS image replacement over CoreOS on our Runners

A successor of gitlab-org/ci-cd/docker-machine#14 (closed)

As we've added all needed product changes we are now able to replace our autoscaled VMs image based on the the deprecated (and no more present in GCP) CoreOS, with the image based on Google's Container Optimized OS (Google COS further from here).

The image build definition change is being prepared at https://dev.gitlab.org/cookbooks/packer-runner-machines/-/merge_requests/40.

As per the thread at gitlab-org/ci-cd/docker-machine#14 (comment 450364955) we already use this image on our private-runners-manager-X.gitlab.com runners.

This issue is to guide us with the incremental rollout through the rest of our fleet.

deploy Google COS to gitlab-shared-runners-manager-X.gitlab.com

https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5017

  • change the SSH username from core to cos in the MachineOptions section
  • drop the engine-opt=mtu=1460 setting (this is already hardcoded by Google in the Container Optiized OS image, in Docker's daemon.json file)
  • drop the IPv6 related settings (these are already hardcoded by us in the Container Optiized OS image, in Docker's daemon.json file)
  • replace /dummy-sys-class-dmi-id:/sys/class/dmi/id:ro with /tmp/dummy-sys-class-dmi-id:/sys/class/dmi/id:ro (/tmp is not a problem as most of our Runners are using the VM only once and even if they are re-used, we never restart the VM; the data must be persisted only for subsequent job executions not between VM restarts)
  • change the google-machine-image value to gitlab-ci-155816/global/images/runners-cos-stable-swtich-to-google-cos
  • add google-metadata=cos-update-strategy=update_disabled MachineOption
  • add google-metadata-from-file=user-data=/etc/gitlab-runner/cloud-config.conf MachineOption

decide if we're ready to do the full switch

  • make the decision

prepare for merging

  • tag gitlab-ci-155816/global/images/runners-cos-stable-swtich-to-google-cos as gitlab-ci-155816/global/images/runners-cos-stable-beta

    Done with GCP's console terminal:

    gcloud compute images create runners-cos-stable-beta --source-image=runners-cos-stable-swtich-to-google-cos
  • update with the stable image name gitlab-ci-155816/global/images/runners-cos-stable-beta https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5030

    • private-runners-manage-X
    • gitlab-shared-runners-manager-X

merge the change in packer repository

deploy Google COS to shared-runners-manager-X.gitlab.com

  • prepare
    • change the SSH username from core to cos in the MachineOptions section
    • drop the engine-opt=mtu=1460 setting (this is already hardcoded by Google in the Container Optiized OS image, in Docker's daemon.json file)
    • drop the IPv6 related settings (these are already hardcoded by us in the Container Optiized OS image, in Docker's daemon.json file)
    • replace /dummy-sys-class-dmi-id:/sys/class/dmi/id:ro with /tmp/dummy-sys-class-dmi-id:/sys/class/dmi/id:ro
    • change the google-machine-image value to gitlab-ci-155816/global/images/runners-cos-stable-beta
    • add google-metadata=cos-update-strategy=update_disabled MachineOption
    • add google-metadata-from-file=user-data=/etc/gitlab-runner/cloud-config.v2.conf MachineOption
  • Introduce the change through production#5184 (closed)

deploy Google COS to gitlab-docker-shared-runners-manager-X.gitlab.com

We'll do this at the very end as we've promissed gitlab-org/gitlab development team. There is some difference in how GitLab Q&A pipeline is executed after switching from CoreOS to Google COS and the team wanted to have some more time to take a look at that.

  • prepare
    • change the SSH username from core to cos in the MachineOptions section
    • drop the engine-opt=mtu=1460 setting (this is already hardcoded by Google in the Container Optiized OS image, in Docker's daemon.json file)
    • drop the IPv6 related settings (these are already hardcoded by us in the Container Optiized OS image, in Docker's daemon.json file)
    • replace /dummy-sys-class-dmi-id:/sys/class/dmi/id:ro with /tmp/dummy-sys-class-dmi-id:/sys/class/dmi/id:ro
    • change the google-machine-image value to gitlab-ci-155816/global/images/runners-cos-stable-beta
    • add google-metadata=cos-update-strategy=update_disabled MachineOption
    • add google-metadata-from-file=user-data=/etc/gitlab-runner/cloud-config.v2.conf MachineOption
  • Introduce the change through production#5258 (closed)

Regressions/Incidents caused during the rollout