Investigate slow image build
Summary
This issue tracks the Distribution tasks from Investigate slow image build (gitlab-com/gl-infra/delivery#19877), creating in team tasks project as it affects both Omnibus and CNG.
Proposal
From gitlab-com/gl-infra/delivery#19877 (comment 1681445532)
- Distribution to continue to investigate the root cause of current increasing (3500s -> 5100s).
- Delivery and Distribution to review the current data and deployment process to agree on a realistic SLA.
- Delivery and/or Distribution to set up monitoring and altering based on the agreed SLA.
- Delivery and Distribution to have regular catch up to further shorter the duration and SLA per future requirements.
Tips
From comment #1463 (comment 1771517129)
As Investigate slow image build (gitlab-com/gl-infra/delivery#19877) was touched recently, and simply was not something we could spend a ton of time on in %16.9, I want to ensure groupdistribution is focusing on the appropriate things related not only to the full run length of the pipeline, but upon what we can actually solve.
Key items to bring for consideration, by groupdistribution:
-
Runtime per Job
- Time between start of first and end of last CI action within the Job: (this does not include any runner action: cloning, artifacts, ...)
- Timing of individual steps within a Job: This evidences things like compile time, remote resource acquisition, cache usage, uploading things to object storage / PackageCloud, so forth ...
- Time by Runner mechanisms: Time spent in start of container, fetch (sources, artifacts, cache), finalization (artifacts, cache)
-
Sum runtime of pipeline
- Created time of pipeline
- Time between Job ready / scheduled, and Job picked up / started by Runner
- Finished time
I want to be clear: groupdistribution should focus on what we can do about these things. If there are systemic issues outside of our CI job content themselves, they are outside the bounds of this team and should be addressed with ~Infrastructure as a whole, and Engineering Productivity