Investigate why our pipeline average/P50 duration went up by 10+ minutes last week
The problem
As discovered in a weekly check, the MR pipelines P50/average duration went up by 10+minutes in the last two weeks (chart):
Pipelines started getting slower on September 3rd 2024 (chart):
We need to find out and hopefully fix the issue.
Root Cause Analysis
FOSS scheduled pipelines didn't run for Ruby 3.2.5
pipelines since 2024-09-04. Those scheduled pipelines are responsible for creating ruby gems cache. Since we didn't have any cache, CI/CD jobs built gems from scratch, an operation that takes around 6 minutes. This affected most CI/CD pipelines in the gitlab-org/gitlab project, because we run as-if-foss
cross-project pipelines in various scenarios.
This issue started when !164548 (merged) was merged on 2024-09-04 3:10AM UTC, which is consistent with the increase we see in the dashboards. We see a first pipeline duration increase from the day before, which might have been caused by the S1s/S2s incidents on that very same day (internal).
We didn't run those pipelines because we apparently override the RUBY_VERSION
variable in FOSS scheduled pipelines. The RUBY_VERSION
value needs to be updated manually, and this was not done after the MR was merged (note that I'm not aware of any documentation regarding this process - it's been added as a corrective action).
Next steps
We manually updated the RUBY_VERSION
in the FOSS scheduled pipeline to 3.2.5
, and the ruby gem cache was successfully built/used.
We'll keep monitoring in the next few days to see whether pipeline duration went back to the values in had at the end of August 2024
Corrective actions
-
Ensure the update-setup-test-env-cache
job can download/extract gitlab_workhorse package.-
Also, should we pass this step as failing, so that we get notified? (Maybe do this only once we have alerts in place 😆 )
-
-
Get notifications when FOSS scheduled pipelines are failing. - A sort of scheduled pipelines "health-check" would be great here:
- Check that the latest scheduled pipelines passed
- Check that scheduled pipelines in certain projects were running recently
- Check that the scheduled pipeline is assigned to the correct user (i.e.
gitlab-bot
) - Do this for several projects the EP team manages
gitlab-org/gitlab
gitlab-org/gitlab-foss
triage-ops
- Important: We are already sending Slack messages when triage-ops scheduled pipelines failed (example). Please have a look whether we could reuse that logic for other projects.
-
Created Receive notifications when scheduled pipelines fail health checks
- A sort of scheduled pipelines "health-check" would be great here:
-
Ensure FOSS scheduled pipelines don't rely on a RUBY_VERSION
hardcoded inside the scheduled pipeline (we currently have to manually edit the scheduled pipeline to see this variable). We should rely on theRUBY_VERSION
andRUBY_VERSION_NEXT
in.gitlab-ci.yml
. -
Comment to .gitlab/ci/version.yml to remind ourselves that we need to update this gitlab-foss pipeline schedule setting when modifying ruby versions.