Release-tools improvement around noticing and notifying unexpected behavior
Problem Description
When failures occur, we get plenty of notifications via Slack. Though if something isn't working, such as a job not running as intended, we may not get notified, thus leading to incorrect expectations of our tooling.
Recent Example: production#15941 (closed)
In the above example, the merge-train
was set to perform it's job every 24 hours. This had a cascading impact where the Release Managers were unaware that they were not deploying new versions of GitLab. Instead, the same version of the GitLab codebase was being redeployed. Nothing in our tooling was ever going to communicate this. Instead, it was discovered after the Release Engineer took note that the version did not increment like it would've been expected.
When processes are not working, as in they are not being initiated, we have no way to alert us that something is wrong.
Potential Solutions
Let's use this issue to discuss. This can be a very broad topic, but I'm hoping we can gather some ideas, and create more targeted end goals after some discussion.