2020-09-11: Child pipelines failing after Feature Flag enabled
Summary
Child pipelines failing
https://gitlab.slack.com/archives/C101F3796/p1599815105336900
The ci_child_of_child_pipeline
feature-flag enabled at 0900UTC 2020-09-11 broke child pipelines for one of our SaaS tenants https://gitlab.my.salesforce.com/00161000004zoBW. Customer is seeing an error "(failed to create a child pipeline due to reaching maximum depth of child pipelines)
"
Timeline
All times UTC.
Click to expand or collapse the Feature Flag Rollout Timeline
2020-09-10
- 09:36 - The feature flag
ci_child_of_child_pipeline
was first enabled for a personal projectfurkanayhan/basic-test
to see if the feature works. - 09:59 - After confirming it worked on the project, FF
ci_child_of_child_pipeline
enabled forgitlab-com/www-gitlab-com
andgitlab-org/gitlab-pages
.
2020-09-11
- 08:58 - After writing a message to
#production
channel, FFci_child_of_child_pipeline
was enabled forgitlab-org/gitlab-runner
and latergitlab-org/gitlab
- 09:04 - FF
ci_child_of_child_pipeline
enabledfor 25% of projects.
2020-09-11
- 22:31 - The first time the Customer's pipeline began failing
- 22:44 - jrreid declares incident in Slack using
/incident declare
command. - 22:58 - Incident was resolved by disabling FF
ci_child_of_child_pipeline
Incident Review
Summary
A customer reported broken parent-child pipelines with errors that said "(failed to create a child pipeline due to reaching maximum depth of child pipelines)"(https://gitlab.my.salesforce.com/00161000004zoBW). Feature flag ci_child_of_child_pipeline
had been enabled 12 hours prior on production, rolling out to 25% of all projects.
- Service(s) affected: CI
- Team attribution: grouppipeline authoring (part of CI)
- Minutes downtime or degradation: 27 minutes
Metrics
Affected projects/pipelines count from @furkanayhan:
SELECT projects.id AS project_id,
COUNT(ci_pipelines.id) AS pipelines_count
FROM projects, ci_pipelines, ci_builds
WHERE projects.id = ci_pipelines.project_id
AND ci_pipelines.id = ci_builds.commit_id
AND ci_builds.type = 'Ci::Bridge'
AND ci_builds.status = 'failed'
AND ci_builds.failure_reason = 1009
GROUP BY projects.id;
---
project_id. | pipelines_count
--------------+-----------------
[redacted] | 4
[redacted] | 6
(2 rows)
Total of 2 projects and 10 pipelines affected
Customer Impact
- Who was impacted by this incident? Specific customers
- What was the customer experience during the incident? configured child of child pipelines failed to run with FF enabled
- How many customers were affected? 1
- If a precise customer impact number is unknown, what is the estimated potential impact? child of child pipelines failed for just under 27mins
Incident Response Analysis
- How was the event detected? Customer reached out to us
- How could detection time be improved? See whether we can use Grafana Dashboards and more monitoring based alerting?
- How did we reach the point where we knew how to mitigate the impact?
chatops
command on #production discovered by @AnthonySandoval - How could time to mitigation be improved? Having dedicated SRE to work with CI team and noting any feature flag roll outs, improved communication on Slack about rollout as well.
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved? Before rolling out a feature flag, run the query to see how many projects could be affected (e.g. by ERROR CODE)
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Not sure, but should see if future backlog items might incur a similar issue.
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? gitlab-org/gitlab#29651 (closed)
5 Whys
- Customers pipelines were failing, why?
- While working on implementing child of child pipelines feature, we needed to limit the number of nested child pipelines in the same project. The limit was set to 2. We added this limit with the implementation, but this limit does not only affect the intended the-same-project-pipelines, but also multi-project-pipelines. And this caused pipelines to get failed.
- Why did this bug not get noticed in staging? Why did this bug not get noticed prior to rolling out in production?
- To reproduce this error, this kind of scenario is required:
- Project A has Pipeline 1
- Pipeline 1 triggers Pipeline 2 of Project B
- Pipeline 2 triggers Pipeline 3 of Project C
- Pipeline 3 tries to create a child pipeline of Project C: Pipeline 4
- Pipeline 4 should be created successfully, but it can not be because our newly introduced hierarchy tree assumes that this child has 3 level depth in the same family!.
- We couldn't think this kind of scenario in the tests. Also, there is no project using pipelines with that scenario in staging or in production among ours.
- We first enabled the feature flag for our projects (gitlab-com/www-gitlab-com, gitlab-org/gitlab-pages, gitlab-org/gitlab-runner and gitlab-org/gitlab), but no error occurred because we did not use that kind of scenario.
- To reproduce this error, this kind of scenario is required:
Lessons Learned
What went well
- Incident resolved pretty quickly and had pretty low impact
- Not surprising this particular customer surfaced this issue, they use this feature pretty extensively
What could be improved
- As engineers, we need to have better communication in the feature rollout issues,
#production
and#support_gitlab-com
Slack channels. - Consider feature flagging to low profile customers only
Corrective Actions
See related issue tracking corrective action items: gitlab-org/gitlab#258217 (closed)
- 2020-09-11 22:44 UTC: Incident was declared.
- 2020-09-11 22:58 UTC: Incident was resolved by disabling FF.
- Then, we've fixed the bug and rolled out the feature flag again.