Automate the shutdown of runner managers
Introduction
Overview of the problem
When a machine with GitLab Runner is installed and running multiple jobs it can not be terminated/killed immediately because the jobs that are running will either be orphaned or terminated. This is why we need to gracefully shut down the process so it can "drain" the jobs, meaning it will no longer pick up new jobs but will keep running the existing ones. Certain jobs have a timeout of 4 hours, so the draining process can take up to 4 hours.
The command we run to gracefully terminate GitLab Runner is systemctl stop gitlab-runner or gitlab-runner stop which do the same thing. Usually, we run knife to terminate instances 1 after another or at once for a specific set of roles, for example, knife ssh -afqdn 'roles:runners-manager-private-blue' -- 'sudo -i systemctl stop gitlab-runner'. These commands are usually run from the engineer laptop that is doing the drain process, meaning that the engineer has to be connected to the internet and babysit this command until it's done. The engineer can use a bastion server to trigger this command so it doesn't depend on their laptop whoever it's still something manual the engineer has to trigger
The current deployment steps are documented in the following runbook.
Glossary
-
drain: The process to stop picking up jobs and wait for the currently running ones to finish. -
graceful shutdown: Stopgitlab-runnerprocess without disturbing the current jobs that are running and do the proper cleanup after the process has terminated. -
bastion: It's a machine located next to the runner managers where engineers don't have root access so they can leave thechef.pemkey there and run knife commands inside of atmux/screensession so their laptop doesn't have to be always connected to the internet. -
runner managers: The machine that is running thegitlab-runnerprocess.
Context and Background
Having the engineer trigger the action manually increases toil and also requires someone with knowledge of the production systems and also access to the production servers. This increases the barrier of entries of who can deploy the system. It is also hard to onboard people since we have to show them how to set up and run manual commands.
Goals
- Automate the
graceful shutdownof GitLab Runner, so an engineer just triggers a pipeline. - The engineer doesn't have to babysit the commands and run the commands from their laptop.
- Have a GitLab CI pipeline to trigger this action when we want to
draina runner fleet. - Have the
drainprocess written in code so when a change needs to happen it's visible through Git.
Non-Goals
- Automate the full deployment process of
gitlab-runner.
Future goals
- The pipeline created for the
drainprocess will be used to build more automation around the deployment process forgitlab-runner.
Assumptions
- Only in the scope of GitLab.com for the
gitlab-runnerprocessed managed by GitLab Inc. employees.
Solutions
Current Solution
The engineer runs the following commands from a laptop or from a bastion server:
knife ssh -afqdn 'roles:runners-manager-private-blue' -- 'sudo -i chef-client-disable "Disable chef until the next deployment"'.
knife ssh -afqdn 'roles:runners-manager-private-blue' -- 'sudo -i systemctl stop gitlab-runner'.
knife ssh -afqdn 'roles:runners-manager-private-blue' -- 'sudo -i systemctl disable gitlab-runner'
Proposed Solution
Create a new project https://gitlab.com/gitlab-com/gl-infra/gitlab-runner-deployer to be mirrored on http://ops.gitlab.net/, following the infrastructure project template. This means that everyone can contribute to a publicly project hosted on GitLab.com but the pipelines run inside of http://ops.gitlab.net/ so only certain engineers have access to the pipelines logs and only certain people can trigger a pipeline and specify secret CI/CD variables. The pipeline will be triggered using chatops command from slack.
The pipeline will run the commands inside of an ansible playbook. The ansible playbook will create a dynamic inventory using Google Cloud Compute Engine inventory source to filter by the labels added to the runner managers. The labels that can be used are runner_manager_group or create a new label shard=private (at the moment we don't have this label, we tried to add it in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2859 but this requires reprovisioning of machines we might have to do address https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13829 first) and then we can use the deployment=$COLOR to filter specific colors.
Ansible will SSH inside of the runenr managers using os login since ssh keys would be easier to manage and we would just require a service account inside of CI pipeline. We already are using this pattern in db-provisioning and there are multiple online resources on how to set this up ^https://yuweisung.medium.com/applying-gcp-os-login-to-terraform-and-ansible-da1d78c016b5^, ^https://alex.dzyoba.com/blog/gcp-ansible-service-account/^. The service account used should be one dedicated just for os-login so the least permission model should be followed.
Using Ansible over knife and chef gives us a few benefits, first, we don't have to have a dedicated chef.pem file and an ssh key available in the pipeline which makes it more secure since we have less secrets, the other benefit that we are already migrating to ansible over chef for our older infrastructure so having a new tool will help us not doing the work twice. Auto-discovery is also a bit simpler since we'll be using GCP labels to filter for compute.
Problems
-
circular dependency: The jobs in http://ops.gitlab.net/ run on the
privateshard. When we want to drain theprivateshard we need to be aware of which color we are going todrainto make sure that we don't end up scheduling the job on the color we are supposedly draining of we'll end up in a deadlock. The best way to solve this is to have theprivateshard runners registered in http://ops.gitlab.net/ tagged with the deployment they belong to, so1-blue.private.runners-manager.gitlab.comand2-blue.private.runners-manager.gitlab.comwill have a tagblue.1-green.private.runners-manager.gitlab.comand2-green.private.runners-manager.gitlab.comwill have thegreentag. We will then have a dynamic child pipeline which will create a child pipeline that will force the jobs to be scheduled on a specificshard. To do this we're going to need add thegreenandbluetags inside of the http://ops.gitlab.net/ shared runners (privateshard) so that we can pick the correct deployment.
journey
title Drain deployment
section Trigger job
Run chatops command in slack: 5 :Me
Trigger job in ops: 5: slack
Create dynamic child pipeline: 5: GitLab CI
section Run Job
Taged child pipeline: 5: GitLab CI
Knife commands to drain machines: 5: rrhelper, GitLab CI
Run job until the drain is finished: 5: GitLab CI