Audit CI insights existing research
What did we learn?
Insights | Resources |
---|---|
There is a lot of existing research that helped us come up with a JTBD for this category, as well as bucket the types of problems that platform engineers and software developers are facing. The problems can be bucketed into speed and status. Speed can include things like identifying bottlenecks in pipelines, understanding duration of pipelines/jobs and comparing it to the baseline, and identifying the baseline duration for pipelines/jobs in general. If things are running as expected speed-wise, then users need information from GitLab to help inform them to make decisions and troubleshoot. Status can include things like failure and success rates of jobs/pipelines, retry rates, and specifically the types of failures that occur most commonly. If jobs/pipelines are failing more than 0% overtime, users need GitLab to help them identify the types of failures, why they are happening, and how to fix them. |
JTBD: When I am managing continuous integration of code at scale, I want to understand the pipeline health, so I can successfully resolve and prevent issues from occurring.
Speed
- As a platform engineer, I want to identify bottlenecks in my pipeline, so I can delegate tasks to the team responsible for that area to get them fixed. For example, infrastructure-related parts.
- Questions that must be answered:
- Which jobs take the longest in my pipeline?
- Which parts of those jobs take the longest?
- Which department or team can be responsible for speeding up that bottleneck?
- Questions that must be answered:
- As a platform engineer, I want to understand if the duration of a specific pipeline run (total time to complete) is close to the best-case duration for that pipeline under optimal conditions, so I can get a sense of it’s speed performance.
- As a software developer/platform engineer, I want to know what the baseline of a pipeline run is, so I can understand if the run time is expected.
Status
- As a platform engineer, I want to know how often the pipelines fail and why they fail, so I can update job frameworks if they are flaky or outdated.
- Questions that must be answered:
- How often does my pipeline fail?
- How often does my job fail?
- What kind of failure is most common in my job or pipeline? Some bucket ideas: approval step, version compatibility, infrastructure failures/config problems, service connection failures, auth connection issues).
- How often does my pipeline get retried?
- How often does my job get retried?
- Questions that must be answered:
What’s this issue all about?
As part of GitLab CI Insights - a unified CI builds & runn... (&11835), we've obtained some research already that tells us which type of data is important for users to see when exploring CI metrics. Other studies have looked into this in the past, so this issue is meant to capture and summarize that data. After summarizing, we should be able to output a list of the next steps so we can move forward with exposing some CI insights.
Who is the target user of the feature?
Self-managed and SaaS users:
- Software developers
- Platform engineers
What questions are you trying to answer?
- Which insights have been found from existing research around CI metrics that could help set the direction for the CI insights category?
- What are the gaps that need more research?
- Which metrics are most important for users who are looking for CI insights?
- Who needs to access these metrics?
Core questions
- Which insights have been found from existing research around CI metrics that could help set the direction for the CI insights category?
Additional questions
What hypotheses and/or assumptions do you have?
What decisions will you make based on the research findings?
This research should help the Runner Fleet team set the direction for where CI insights will go and identify next steps for us.
What's the latest milestone that the research will still be useful to you?
16.9