Reduce baking time duration in coordinated pipelines
Context
Coordinated pipelines have a baking time job whose main purpose is to let the auto-deploy package sit on canary for one hour before promoting to production. We do this as a way to spot anomalies before promoting the package to production.
Baking time job is triggered after the canary deployment is completed, and one of the last tasks of a canary deployment is to execute QA. QA on canary is placed on the gprd-cny-qa
stage that is composed of three jobs:
-
gprd-cny-gitlab-qa-full
- Usually short, its execution-only takes a few seconds. -
gprd-cny-gitlab-qa-smoke
- This one takes roughly 15 minutes. If it fails, it's automatically retried. -
gprd-cny-gitlab-qa-smoke-main
- This one takes roughly 15 minutes and it's allowed to fail.
Problem
QA execution on canary is lengthy, and Delivery team doesn't have much control of its length. Consider the QA execution in canary of the last 10 deployments:
Deployment |
gprd-cny-qa stage duration |
---|---|
14.4.202110182020-03ef2912ee9.2885f03af62 |
32 minutes |
14.4.202110181820-03ef2912ee9.741e21d73ff |
30 minutes |
14.4.202110182020-03ef2912ee9.2885f03af62 |
38 minutes |
14.4.202110180920-02ab3b585dc.522c391b5c7 |
36 minutes |
14.4.202110180620-94a652edd35.fd277033138 |
51 minutes |
14.4.202110180320-21388c5b44a.127b9d620ce |
31 minutes |
14.4.202110131820-5a266eef146.44d20885c78 |
30 minutes |
14.4.202110131612-187db88810c.b3a8f3e8817 |
31 minutes |
14.4.202110130720-0ac2f3dfa55.523842120d6 |
33 minutes |
14.4.202110122020-6cb22fc9cfb.2e9617f4a26 |
30 minutes |
On average, canary QA takes around 34 minutes, if we add one hour of baking time, that means the deployments between canary and production are 1h 34min apart. The gap can be longer if QA fails and it's retried on canary, as it was with 14.4.202110180620-94a652edd35.fd277033138
, in that case, the deployments between canary and production were separated by almost two hours.
Proposal
Delivery team has control over the baking time duration, and we can decrease it to shorten the MTTP. Based on the above information, we could consider decreasing baking time to 30 minutes. Doing so has the following advantages:
- Depending on timing, it gives us a chance to do another deployment to production.
- It brings us closer to #1644
- We still have one hour to spot anomalies: By the time we start QA, Canary is running the new version for live traffic, so having ~30 minutes of QA plus 30 minutes of baking time, gives enough time to spot any issues before promoting the changes to production.