Automate the shutdown of runner managers

Introduction

Overview of the problem

When a machine with GitLab Runner is installed and running multiple jobs it can not be terminated/killed immediately because the jobs that are running will either be orphaned or terminated. This is why we need to gracefully shut down the process so it can "drain" the jobs, meaning it will no longer pick up new jobs but will keep running the existing ones. Certain jobs have a timeout of 4 hours, so the draining process can take up to 4 hours.

The command we run to gracefully terminate GitLab Runner is systemctl stop gitlab-runner or gitlab-runner stop which do the same thing. Usually, we run knife to terminate instances 1 after another or at once for a specific set of roles, for example, knife ssh -afqdn 'roles:runners-manager-private-blue' -- 'sudo -i systemctl stop gitlab-runner'. These commands are usually run from the engineer laptop that is doing the drain process, meaning that the engineer has to be connected to the internet and babysit this command until it's done. The engineer can use a bastion server to trigger this command so it doesn't depend on their laptop whoever it's still something manual the engineer has to trigger

The current deployment steps are documented in the following runbook.

Glossary

drain: The process to stop picking up jobs and wait for the currently running ones to finish.
graceful shutdown: Stop gitlab-runner process without disturbing the current jobs that are running and do the proper cleanup after the process has terminated.
bastion: It's a machine located next to the runner managers where engineers don't have root access so they can leave the chef.pem key there and run knife commands inside of a tmux/screen session so their laptop doesn't have to be always connected to the internet.
runner managers: The machine that is running the gitlab-runner process.

Context and Background

Having the engineer trigger the action manually increases toil and also requires someone with knowledge of the production systems and also access to the production servers. This increases the barrier of entries of who can deploy the system. It is also hard to onboard people since we have to show them how to set up and run manual commands.

Goals

Automate the graceful shutdown of GitLab Runner, so an engineer just triggers a pipeline.
The engineer doesn't have to babysit the commands and run the commands from their laptop.
Have a GitLab CI pipeline to trigger this action when we want to drain a runner fleet.
Have the drain process written in code so when a change needs to happen it's visible through Git.

Non-Goals

Automate the full deployment process of gitlab-runner.

Future goals

The pipeline created for the drain process will be used to build more automation around the deployment process for gitlab-runner.

Assumptions

Only in the scope of GitLab.com for the gitlab-runner processed managed by GitLab Inc. employees.

Solutions

Current Solution

The engineer runs the following commands from a laptop or from a bastion server:

knife ssh -afqdn 'roles:runners-manager-private-blue' -- 'sudo -i chef-client-disable "Disable chef until the next deployment"'.
knife ssh -afqdn 'roles:runners-manager-private-blue' -- 'sudo -i systemctl stop gitlab-runner'.
knife ssh -afqdn 'roles:runners-manager-private-blue' -- 'sudo -i systemctl disable gitlab-runner'

Proposed Solution

Create a new project https://gitlab.com/gitlab-com/gl-infra/gitlab-runner-deployer to be mirrored on http://ops.gitlab.net/, following the infrastructure project template. This means that everyone can contribute to a publicly project hosted on GitLab.com but the pipelines run inside of http://ops.gitlab.net/ so only certain engineers have access to the pipelines logs and only certain people can trigger a pipeline and specify secret CI/CD variables. The pipeline will be triggered using chatops command from slack.

The pipeline will run the commands inside of an ansible playbook. The ansible playbook will create a dynamic inventory using Google Cloud Compute Engine inventory source to filter by the labels added to the runner managers. The labels that can be used are runner_manager_group or create a new label shard=private (at the moment we don't have this label, we tried to add it in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2859 but this requires reprovisioning of machines we might have to do address https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13829 first) and then we can use the deployment=$COLOR to filter specific colors.

Ansible will SSH inside of the runenr managers using os login since ssh keys would be easier to manage and we would just require a service account inside of CI pipeline. We already are using this pattern in db-provisioning and there are multiple online resources on how to set this up ^https://yuweisung.medium.com/applying-gcp-os-login-to-terraform-and-ansible-da1d78c016b5^, ^https://alex.dzyoba.com/blog/gcp-ansible-service-account/^. The service account used should be one dedicated just for os-login so the least permission model should be followed.

Using Ansible over knife and chef gives us a few benefits, first, we don't have to have a dedicated chef.pem file and an ssh key available in the pipeline which makes it more secure since we have less secrets, the other benefit that we are already migrating to ansible over chef for our older infrastructure so having a new tool will help us not doing the work twice. Auto-discovery is also a bit simpler since we'll be using GCP labels to filter for compute.

Problems

circular dependency: The jobs in http://ops.gitlab.net/ run on the private shard. When we want to drain the private shard we need to be aware of which color we are going to drain to make sure that we don't end up scheduling the job on the color we are supposedly draining of we'll end up in a deadlock. The best way to solve this is to have the private shard runners registered in http://ops.gitlab.net/ tagged with the deployment they belong to, so 1-blue.private.runners-manager.gitlab.com and 2-blue.private.runners-manager.gitlab.com will have a tag blue. 1-green.private.runners-manager.gitlab.com and 2-green.private.runners-manager.gitlab.com will have the green tag. We will then have a dynamic child pipeline which will create a child pipeline that will force the jobs to be scheduled on a specific shard. To do this we're going to need add the green and blue tags inside of the http://ops.gitlab.net/ shared runners (private shard) so that we can pick the correct deployment.

journey
    title Drain deployment
    section Trigger job
      Run chatops command in slack: 5 :Me
      Trigger job in ops: 5: slack
      Create dynamic child pipeline: 5: GitLab CI
    section Run Job
      Taged child pipeline: 5: GitLab CI
      Knife commands to drain machines: 5: rrhelper, GitLab CI
      Run job until the drain is finished: 5: GitLab CI

Edited Aug 23, 2021 by Steve Xuereb