Proposal: How to validate that your CI/CD change is effective?
Context
I'll start with a concrete example.
As part of the Test Selection Gap Epic (&6 (closed)), I made gitlab-org/gitlab!107996 (merged), which improves the detect-tests
job to discover more JS tests to run.
I would like to know how many times the Jest minimal
jobs failed thanks to that MR. In other words, I'd like to understand how effective was that MR at preventing a potential master-broken issue.
More generally, I've been wondering at how we could add more precise monitoring to our CI/CD jobs, so that we would have some metrics the EP team is interested in.
Another example: if an MR introduces a cache, it might be interesting to know the cache hit/cache miss rate of that cache. This is something that could be inferred from the time the job takes, but I think we could make it more exact by adding some instrumentation (i.e. we upload some metrics to a monitoring tool).
Goal
- Have a more granular monitoring for our CI/CD jobs than the data we get from GitLab analytics database.
- Have this new way of doing things documented on our handbook page (or anywhere it would be relevant to have - GitLab docs, ...)
Technical implementation
This is the good news: we already have a good way to do this
Looking at https://gitlab.com/gitlab-data/analytics/-/merge_requests/5438 and https://gitlab.com/gitlab-data/analytics/-/merge_requests/7505/diffs, it looks like the process is as follows:
- Write any metrics you'd like in a file in your CI job
- Upload this file in an artifact (or GitLab pages according to the title of https://gitlab.com/gitlab-data/analytics/-/merge_requests/5438)
- Extract this artifact from all pipelines from the https://gitlab.com/gitlab-data/analytics project in Python
- Modify the DB schema of the data sources for Sisense to use (e.g. https://gitlab.com/gitlab-data/analytics/-/merge_requests/7505)