Skip to content

FY21-Q3 Infra KR: Work with CI/CD - Verify team to improve insight into CI efficiency => 100%

I have broken this OKR out from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9924 as there are several items that will not be able to be done in Q2.

Base Cost Metrics:

  • Total CI Cost
  • Total CI Cost Growth
  • Total CI Cost by Runner Manager

Labeling Structure for billing insights:

Requirements Proposed Owner
Runner Managers Labeled Infra (manual label)
Ephemeral Machines Labeled Verify team
Identify runner manager vs ephemeral machines with label Infra (manual label)

End Goals

  • Build base cost metric PIs for verify-runner group
  • Label runner managers & Ephemeral Machines with respective labels for cost by runner manager analysis

OKR Retrospective

This retrospective is meant to be an impartial summary of of what went well as well as areas where we could improve so that we can keep improving for next time.

Thanks to @DarrenEastman @steveazz @akohlbecker @ahanselka for the work on this OKR

What Went Well

We had good collaboration between verify team and infra team and were able to align on our goals mostly async after a couple of initial meetings. Although not fully done, we are very close to having new data points that we did not have before, in terms of the actual cost breakdown by the different runner managers.

What didn't go well

It took a few iterations to nail down what our end goal should be for this, and while I think what we ended up with is good and aligns with what we are doing across the company, it would have been better if we could solidify that earlier on in the process. As a result, our timelines were slightly off and that is partly why the last remaining work is ongoing.

What should we try to do in the future?

I think as a next step we could look at doing something similar, but label servers with respective job id's to get even more insight into our CI Cost.

Labels on runner managers themselves are still manual today, so would be good to have an automatic process for that so we don't have to worry about it in the future.

It would be good to have continued collaboration on these CI costs between Infra Analyst / Infra PM, and Verify PM so that we have a better shared understanding of costs going forward

Our end goal strayed away from efficiency of the service into more cost visibility. I think this was good and where we should be focusing from our goals and how we are scaling cost insights from infra perspective, but I think as a separate project, it would still be useful to do analytics on the CI service itself and how it could be more efficient. For example by looking at startup, shut down times seeing how to minimize those idle times.

Edited by Davis Townsend