Pipeline duration metrics do not seem correct with child pipelines

It seems our pipelines duration calculation could be wrong in GitLab (and therefore Sisense). See https://gitlab.com/gitlab-org/gitlab/-/pipelines/452082225 which reports taking only 29 minutes, while the child pipeline itself takes 62 minutes: https://gitlab.com/gitlab-org/gitlab/-/pipelines/452086059. I noticed that while looking at https://app.periscopedata.com/app/gitlab/564156/EP---Pipelines-health?widget=11750100&udv=895976 where “code” pipelines take longer than “frontend” ones, which seems unexpected.

  • @rymai - find out from with grouppipeline execution if this is desired or unexpected. We believe this should be unexpected for start-review-app-pipeline because of strategy:depend
  • @rymai - continue finding a few examples
  • @kwiebers identify the scope of impacted PIs and dashboards.

Discussed in https://docs.google.com/document/d/1vgEAUOzph1WS4_0olmywYai9HMyIvo63l2BaPIOrm90/edit#bookmark=id.lv9nlob25buz

Example:

  • Parent pipeline queued for 7 seconds (shown in the UI), started at 2022-03-11 10:23:10.063 and finished at 2022-03-13 11:26:05.387 (2 days, 1 hour, 3 minutes), duration reported as 53 minutes (shown in the UI)
  • Child pipeline queued for 2 days, 40 minutes (shown in the UI), started at 2022-03-13 11:24:03.062 and finished at 2022-03-13 11:26:05.249 (2 minutes!), duration reported as 32 minutes (shown in the UI)
    • review-build-cng-env queued for 1 second, started at 2022-03-11 10:44:45.615 and finished at 2022-03-11 10:45:36.228 (50 seconds)
    • allure-report-qa-reliable queued for 53 minutes, started at 2022-03-11 11:37:33.911 and finished at 2022-03-11 11:38:56.446 (1 minute 23 seconds)
    • review-stop queued for 2 days, 40 minutes, 2022-03-13 11:24:04.194 and finished at 2022-03-13 11:26:05.168 (2 minutes). This timestamp coincides with the AutoStopCronWorker which runs by default every hour on minute 24: https://gitlab.com/gitlab-org/gitlab/-/blob/3a71a48b5954d8b03f91885913b80c8178e30e00/config/gitlab.yml.example#L517-519

Notes:

  • Pipeline durations are based on jobs' durations (see https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/ci/pipeline/duration.rb), it's a CI minutes duration, not a wall-clock duration.
  • start-review-app-pipeline starts approximately 21 minutes 30 seconds (2022-03-11 10:44:44.184) after the parent pipeline started (2022-03-11 10:23:03.435)
  • Parent pipeline duration (including the child pipeline one) should then be 53 minutes - 21 minutes = 32 minutes, which what's stored (and shown in the UI) in the database

Open questions

  1. Why does the child pipeline started_at timestamp (2022-03-13 11:24:03.062) is set to (almost) the same started_at timestamp as the review-stop (2022-03-13 11:24:04.194)?

Conclusion

By looking closer, it seems start-review-app-pipeline is actually not in the critical path anymore: it takes 28 minutes on average these days (https://app.periscopedata.com/app/gitlab/496118/EP---Sandbox?widget=14116510&udv=785399), and they start after 20-21 minutes due to its prerequisite jobs.

That means some of the rspec * jobs are now the critical jobs in both code and frontend pipelines, but code pipelines run the additional rspec:undercoverage job, which explains why code pipelines are slower than frontend pipelines! 💡

Also, it seems rspec system jobs got 5 minutes slower since 2022-02-14 (Apollo was updated from 2 to 3 that day: https://gitlab.slack.com/archives/C0GQHHPGW/p1644837170251409):

Lastly, it seems the child pipelines went from 50 to 30 minutes since 2022-02-14/16... (We did merge gitlab-org/gitlab!80121 (merged) on 2022-02-15).

We should investigate that. 🤔

Edited by Rémy Coutable