Release Coordinated Pipeline needs some sort of active monitoring thing
Problem Statement
There are a number of bugs or situations that prevent the Release Coordinator pipeline from running very efficiently. Sometimes to a great detriment. Until all bugs or all scenarios that cause us pain are addressed, we should consider building some sort of pipeline monitor that enables us to inject custom logic such that we can help with addressing potential problems sooner. Doing so should enaable a more smooth experience with release management for auto-deployments overall.
Ideation
- create a rake task in release-tools that monitors our pipeline
- create a completely new tool since we hope that this wouldn't need to be a permentant thing
Whatever the tool, the idea is that it is actively watching the Release Coordinated pipeline. Perhaps running as a script inside of a Job inside of the same pipeline, or perhaps an external app that is constantly making API calls to discover the desired information. Ideas for what to look out for:
- When a pipeline assumes the wrong owner, this tool can somehow fix this situation which would allow us to bypass: gitlab-org/gitlab#348465 (closed)
-
Sometimes QA starts to fail, but since we auto-retry upwards of 3 times, we don't get notified until after QA has failed 3 times. We can leverage this tool to alert us to the first time it fails that way we can begin engaging sooner while watching retries.#2435 (comment 991504439) - There are situations where jobs fail, we retry, and they fail again. We are only notified on the first failure. Unless we are actively watching the pipeline, we'll miss the second failure. We could enable a more streamlined experience by moving the notification stuff out of CI jobs and into this fancy tool. Reference feature request: gitlab-org/gitlab#30401
- Some jobs have long timeouts in the case that something is taking longer than usual, think an incident which is making things slow perhaps. In situations like this, we don't know things are wrong until we are notified of a failure. We could apply some logic that takes into account the average runtime of a particular job. If that is exceeded, we can be notified, and start investigation immediately to determine if there's a big problem vs waiting until the notification happens which is sometimes a few hours after the job started.
- ...
Considerations
- Are there existing tools which can accomplish this for us?
- ...