FY21-Q3 Infra KR: Work with CI/CD - Verify team to improve insight into CI efficiency => 100%
I have broken this OKR out from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9924 as there are several items that will not be able to be done in Q2.
Base Cost Metrics:
- Total CI Cost
- Total CI Cost Growth
- Total CI Cost by Runner Manager
Labeling Structure for billing insights:
| Requirements | Proposed Owner | 
|---|---|
| Runner Managers Labeled | Infra (manual label) | 
| Ephemeral Machines Labeled | Verify team | 
| Identify runner manager vs ephemeral machines with label | Infra (manual label) | 
End Goals
- 
Build base cost metric PIs for verify-runner group 
- 
Label runner managers & Ephemeral Machines with respective labels for cost by runner manager analysis 
OKR Retrospective
This retrospective is meant to be an impartial summary of of what went well as well as areas where we could improve so that we can keep improving for next time.
Thanks to @DarrenEastman @steveazz @akohlbecker @ahanselka for the work on this OKR
What Went Well
We had good collaboration between verify team and infra team and were able to align on our goals mostly async after a couple of initial meetings. Although not fully done, we are very close to having new data points that we did not have before, in terms of the actual cost breakdown by the different runner managers.
What didn't go well
It took a few iterations to nail down what our end goal should be for this, and while I think what we ended up with is good and aligns with what we are doing across the company, it would have been better if we could solidify that earlier on in the process. As a result, our timelines were slightly off and that is partly why the last remaining work is ongoing.
What should we try to do in the future?
I think as a next step we could look at doing something similar, but label servers with respective job id's to get even more insight into our CI Cost.
Labels on runner managers themselves are still manual today, so would be good to have an automatic process for that so we don't have to worry about it in the future.
It would be good to have continued collaboration on these CI costs between Infra Analyst / Infra PM, and Verify PM so that we have a better shared understanding of costs going forward
Our end goal strayed away from efficiency of the service into more cost visibility. I think this was good and where we should be focusing from our goals and how we are scaling cost insights from infra perspective, but I think as a separate project, it would still be useful to do analytics on the CI service itself and how it could be more efficient. For example by looking at startup, shut down times seeing how to minimize those idle times.