Implement generic ChatOps command to retry CI jobs, with application-level allowlist
Summary
Release supervisors are GitLab contractors who have limited permissions across several key projects and namespaces. I propose that we implement a new generic ChatOps command which takes a Job URL as input and uses the release-tools-bot's permissions to retry a job. We can use an allowlist to control which jobs can be retried using this method.
This is an idea which came out of a recent session with @akozin-ext where we discovered that he can not retry QA smoke CI jobs. (gitlab-org/release/tasks#21571 (comment 2833259621))
Command:
/chatops run retry <JOB_URL>
/chatops run trigger <MANUAL_JOB_URL>
Allowlist:
[
{
"project": "gitlab-org/quality/staging-canary",
"jobs": [
"qa-smoke 1/8",
"qa-smoke 2/8",
"qa-smoke 3/8",
"qa-smoke 4/8",
"qa-smoke 5/8",
"qa-smoke 6/8",
"qa-smoke 7/8",
"qa-smoke 8/8",
]
},
{
"project": "gitlab-com/gl-infra/k8s-workloads/gitlab-com",
"jobs": [
"gstg:auto-deploy",
"gstg-us-east1-b:auto-deploy"
]
}
]
This avoids having to implement bespoke commands for each CI job that we want contractors to be able to trigger or retry, as we are doing in https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/21565.
This gives us granular control over what the release supervisors are allowed to retry. For instance, we could make rules such as "retrying a job for Production in Kubernetes" always require the involvement of a release manager.
On the other hand, this might be overkill because we are re-implementing to get around the limitations of GitLab CI/CD's permissions model (only those who can push to a protected branch can retry CI jobs on that protected branch!)
If we can simply give contractors Maintainer permissions across the majority of our repositories, this implementation would be unnecessary. As it is unclear at this point whether we can do that, I wanted to write down this option.
Description of the problem
One of the most common failure modes during auto-deploy is a failing smoke test job (https://ops.gitlab.net/gitlab-org/quality/staging-canary/-/jobs/20668726) The method for resolving this is to simply retry the CI job. However, contractors do not have permissions to retry this CI job currently, making it impossible for them to independently steward the auto-deploy process.
The same is true for the monthly release process, where once the underlying cause of a failing job is fixed, we retry the CI job in the release/tools Monthly release pipeline.
Contractors must be given at least Maintainer-level access on projects such as quality/{staging,staging-canary,...}, release/tools and k8s-workloads/gitlab-com in order to be able to retry failed jobs. Developer access is not sufficient due to the following restriction:
Run, rerun, or retry CI/CD pipeline or job for a protected branch (6):
DeveloperandMaintainerFootnote 6: Developers and maintainers: Only if the user is allowed to merge or push to the protected branch.