Monitor CI end to end with prometheus
Currently all the metrics we have from CI as extracted from the DB buy running a set of queries. These metrics only exists in checkmk.
We need to start monitoring all the pieces of CI properly, starting with the application logic, then the managers, then the runners and finally the upcoming review apps.
The idea here will be to have someone who is proficient in go working on measuring everything that can be measured on the managers that we own by exporting these metrics so we can scrape them with our main prometheus server, and then move to find a way of scraping the runners.
The tricky part comes with these runners because they are ephemeral and we don't really know what should be measured there, so we will need support from the CI team explaining how the whole process works and what can be valuable to get.
Things that pop in my head right now are:
- General CI health
- Queued builds
- Started/Finished builds
- Started/Finished runners (shared/specific/in error)
- Builds sent to a given runner (health, reuse)
- Total requests sent to DO, successful, retried, failed (to detect when DO fails)
- Runners/builds
- Uptime per runner
- Startup time (from booting to when the build is ready to run, WALL clock)
- Build total time
- Bandwidth used during the build
- IOPS used during the build
- Storage used during the build.
- Total number of requests required to get a build (success rate?)
- Artifact tracking (how many artifacts, how large)
I'm open for a discussion on what else can we measure here.
One thing that is missing here and will be interesting to include as a future thing to deal with is Review Apps, we should setup the ground to include monitoring on this feature.