What can we do about cleaning up failed deploys in CI?
Currently when a change is made to the charts, review environments are deployed (to GKE and EKS) in stage Review
. Tests are run against these environments in the Specs
stage. If the deploy job fails the environment fails, the stop job for the environment cannot be run and resources must be cleaned up manually by Distribution engineers.
We need to do some investigation into this and figure out what is the right approach.
Designs
- Show closed items
Relates to
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
added For Scheduling devopssystems groupdistribution priority3 severity3 spike labels
- Dustin Collins marked this issue as related to #2075 (closed)
marked this issue as related to #2075 (closed)
- Dustin Collins added backstage [DEPRECATED] label
added backstage [DEPRECATED] label
- Contributor
To add some pipeline examples of this happening:
- https://gitlab.com/gitlab-org/charts/gitlab/-/jobs/547432615
- https://gitlab.com/gitlab-org/charts/gitlab/-/jobs/547595276
Also, the team mentioned this command to find the related CI job (& MR):
helm get values {name} | yq '.ci'
And a note that the CI cluster uses Helm 2 (
asdf shell helm 2.16.6
). - Mitchell Nielsen marked this issue as related to #2083 (closed)
marked this issue as related to #2083 (closed)
- 🤖 GitLab Bot 🤖 added sectioncore platform label
added sectioncore platform label
- Robert Marshall added to epic gitlab-org&5748
added to epic gitlab-org&5748
- Robert Marshall changed epic to gitlab-org&6693
changed epic to gitlab-org&6693
- Balasankar 'Balu' C added group::distributiondeploy label
added group::distributiondeploy label
- Dilan Orrino removed For Scheduling label
removed For Scheduling label
- Dilan Orrino added Deliverable label
added Deliverable label
- Dilan Orrino changed milestone to %15.7
changed milestone to %15.7
- Maintainer
@dorrino - please see the following guidance and update this issue.1 Error Please add typebug typefeature, typemaintenance or a subtype label to this issue. If you do not feel the purpose of this issue matches one of the types, you may apply the typeignore label to exclude it from type tracking metrics and future prompts.
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#8940 (closed)
mentioned in issue gitlab-org/quality/triage-reports#8940 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#9026 (closed)
mentioned in issue gitlab-org/quality/triage-reports#9026 (closed)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#9125 (closed)
mentioned in issue gitlab-org/quality/triage-reports#9125 (closed)
- Nailia Iskhakova added maintenancepipelines typemaintenance labels
added maintenancepipelines typemaintenance labels
- Nailia Iskhakova added quad-planningcomplete-no-action label
added quad-planningcomplete-no-action label
- DJ Mountney changed milestone to %15.10
changed milestone to %15.10
- Peter Lu changed milestone to %Next 1-3 releases
changed milestone to %Next 1-3 releases
- DJ Mountney mentioned in issue gitlab-org/distribution/team-tasks#1176 (closed)
mentioned in issue gitlab-org/distribution/team-tasks#1176 (closed)
- DJ Mountney added FY24Q4 label
added FY24Q4 label
- DJ Mountney added Distribution OKRO1KR1 FY24Q1 labels and removed FY24Q4 label
added Distribution OKRO1KR1 FY24Q1 labels and removed FY24Q4 label
- Maintainer
Setting health status to
on track
as the milestone has just begun.Issue participants are welcome to override this by setting the health status to another value.
- 🤖 GitLab Bot 🤖 changed health status to on track
changed health status to on track
- DJ Mountney changed milestone to %Next 1-3 releases
changed milestone to %Next 1-3 releases
- DJ Mountney added priority2 label and removed priority3 label
- Developer
!3453 (merged) fixed some of these issues by setting up
auto stop in
create_review_*job that cannot fail. However, if you retry a
review_*job, the
auto_stop_inin the corresponding
create_review_*` is canceled and the environment is left hanging. There is a fix for the underlying issue at gitlab-org/gitlab#382549 (closed). I suggest enabling this feature (after some testing) to see if it will remedy our retried job not getting cleaned up issue.