delivery issueshttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues2024-03-27T16:15:10Zhttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20072📣 Announcement: Monthly Release Information dashboard2024-03-27T16:15:10ZJenny Kim📣 Announcement: Monthly Release Information dashboardGitLab monthly releases are published each month.
~"team::Delivery-Releases" is introducing a [Grafana dashboard "delivery: Release Information"](https://dashboards.gitlab.net/d/delivery-release_info/delivery3a-release-information?orgId...GitLab monthly releases are published each month.
~"team::Delivery-Releases" is introducing a [Grafana dashboard "delivery: Release Information"](https://dashboards.gitlab.net/d/delivery-release_info/delivery3a-release-information?orgId=1) that includes information and status of the monthly releases in order to increase discoverability, visibility, and transparency about GitLab releases.
[Grafana dashboard "delivery: Release Information"](https://dashboards.gitlab.net/d/delivery-release_info/delivery3a-release-information?orgId=1)
## What is on the dashboard?
This is the first iteration of the dashboard. Currently it shows the following information:
* Active monthly release version
* Active monthly release date
* Current status of the active monthly release
* Links for release-related resources
![image.png](/uploads/439e9b689bccec6d30cd7b3f2094368a/image.png)We are aiming for this dashboard to contain more information in future iterations, such as information and status about the patch/security release. It will also be iterated further upon feedback.
## Why did we introduce this dashboard?
The impact of GitLab releases spans multiple GitLab departments, however, the full picture of them is limited to the Delivery team.
This dashboard information is to be consumed by the stage groups, so they can easily monitor the status of the monthly release, and plan accordingly. We hope to increase discoverability and transparency about the release processes with the introduction of this dashboard.
## Who does this impact?
* Stage group engineers, PMs, EMs
* Those inquiring information about current monthly release status
* Those planning merge requests/commits to be included in the milestone release
## When does the dashboard information get updated?
The information displayed on this dashboard will continue to update as release managers proceed with the [self-managed releases process](https://handbook.gitlab.com/handbook/engineering/releases/#self-managed-releases-process).
## FAQ
### What about information about the patch/security releases?
Currently the dashboard contains information about the monthly releases. [Information about the active patch releases](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19988#priority-list) is planned to be implemented in future iterations to come.
### Will there be any changes to the ChatOps announcements made in slack channels?
The announcements that the release managers make via `ChatOps` bot in slack channels such as `#releases` do not plan to change in cadence. They will continue to contain information about the commit to be released. The announcement going forward will also contain the link to the dashboard for easier access.
### Looking at the status, how can I know if a MR merge commit is included in the release?
There is a guaranteed commit that we usually announce on the Friday before the monthly release. This is when the status on the dashboard changes to "announced".
* "Open": MR merge commits are expected to be included in the release
* "Announced": MR merge commit is not guaranteed to be included in the release (since it may not be fully deployed to production before the RC gets tagged)
* "RC Tagged": MR merge commit is not included in the release
Please refer to https://handbook.gitlab.com/handbook/engineering/releases/#how-can-i-determine-if-my-merge-request-will-make-it-into-the-monthly-release for more information.
## Feedback
If you have any comments, questions, or feedback about this dashboard, please post it on this issue, or on the [`#releases` slack channel](https://gitlab.enterprise.slack.com/archives/C0XM5UU6B).16.10Jenny KimJenny Kimhttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20066Define with Test Platform team about the usage of Release Environment in QA's...2024-03-28T11:21:30ZDat TangDefine with Test Platform team about the usage of Release Environment in QA's workflowWe use Release Environment for validating stable branches, and with this use case, Test Platform is our direct customer - they need to know the status of the pipeline and take action accordingly.
According to the [epic](https://gitlab.c...We use Release Environment for validating stable branches, and with this use case, Test Platform is our direct customer - they need to know the status of the pipeline and take action accordingly.
According to the [epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/943#goal-goal "Release Environments Fully Operational for Stable Branches"), the Release Environment aims to provide:
* Bring more confidence about our releasing process, that the backports were tested against a running environment that is as close to production as possible, thus deliver more stable patch releases to the customers.
* A long-lived environment for each minor version inside the [maintenance policy](https://docs.gitlab.com/ee/policy/maintenance.html), that we can deploy a new backport to immediately once it is available.
* Ability to spin up any out of support GitLab minor versions in case of out of policy backport requests.
* Provide GitLab environments where teams can access and debug issues, without manually set up on local machine, thus lower the lead time of backport requests.
From what we know, Test Platform uses channels like "#qa-production" and "qa-staging" to know about the result of QA pipelines. Delivery and QA need to agree on:
1. What does Test Platform expect from Release Environment in validating new commits in stable branches? (I think it is the same as on `.com`, just that now Test Platform doesn't need to run the tests locally but has it automatically)
2. How to notify Test Platform about the result of release environment pipelines (e.g. which Slack channel to use)
3. How to implement the notification?
### Exit Criteria
* [ ] Contact, discuss and agree with Test Platform about the above questions
* [ ] Create issues accordingly
### Out of Scope
* Actual implementation of notification channels. It should be done by https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19999.Dat TangDat Tanghttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19944Delivery::Deployments FY25 Q1 OKRs ideas and discussion2024-02-09T18:58:36ZDave Smithdsmith@gitlab.comDelivery::Deployments FY25 Q1 OKRs ideas and discussion### Context
We are few weeks away from FY25 Q1 and it is time to start planning for OKRs. Let's use this issue to suggest possible ideas and discuss the best combination of ideas and scope to set us up for Q1.
### FY24 Q4 State
As De...### Context
We are few weeks away from FY25 Q1 and it is time to start planning for OKRs. Let's use this issue to suggest possible ideas and discuss the best combination of ideas and scope to set us up for Q1.
### FY24 Q4 State
As Delivery::Orchestration great progress has been made on the Security Release process and automation domain in the last quarters through:
1. [Automating and combining bug fixes and security fixes into patch releases](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1073)
2. [Reduce Security release preparation to less than 24hrs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1061)
3. The work on [Creating Release Environments, for testing new releases in the supported maintenance policy](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1061) restarted, and it has made significant progress.
4. [We are currently in the final weeks of the Pilot of two scheduled security releases per month in preparation for planned releases](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1125)
The Delivery::System team made significant progress in bringing the [Dedicated Tenant Upgrades to Zero Downtime](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1150).
### FY25 Q1 - What's next?
We have plenty of work to iterate further on the FY24 Q4 projects.
- Should we look at changes to our deployment pipelines to split out by environment(s)?
- Related to splitting out deployments by environments, what about by service/type of deployment?
- What further improvements do we need to do for deployments to dedicated?
- What wrap up for Release Environments do we need to do?
- What tech debt do we have that needs to be taken care of?
#### What about Cells?
With Cells moving to an earlier timeline with a [different initial design](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/139519) it is time to start thinking about an Objective that could bring us a step closer to the solution. Considering that the first iteration of Cells will be fully based on Dedicated and we are gaining expertise in how to Release to Dedicated, are there opportunities for the Delivery::Release team in that direction?
Please let's share our OKR ideas for Q1.
cc @rpereira2, @ggillies, @vglafirov, @anganga, @nolith, @skarbekhttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19460Test GitLab merge requests as a long term solution to sync security changes t...2023-11-02T16:55:27ZMayra CabreraTest GitLab merge requests as a long term solution to sync security changes to GitLab repositoryOn https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19402, the merge-train was adopted to sync security changes to GitLab canonical once the security release is out. This strategy helped reduce the time release managers spent on ...On https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19402, the merge-train was adopted to sync security changes to GitLab canonical once the security release is out. This strategy helped reduce the time release managers spent on the last section of a security release and made the syncing experience painless (details https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19402#note_1451901930).
Using the merge-train as a solution works for now, but it might not work in the long term:
* The git traffic for the GitLab project is only going to increase with more engineers joining the team and more features being developed.
* It might conflict with the adoption of the [merge train feature](https://docs.gitlab.com/ee/ci/pipelines/merge_trains.html) in the GitLab project https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/195
As a long-term solution, able to support our scaling needs, we would like to test the GitLab merge request feature as a solution for syncing security commits into the canonical repository at the end of the security release.
Some of the key benefits of this approach are:
1. We are saving a lot of time and bandwidth by not cloning the security and the canonical repositories to perform the merge inside the CI job
2. It scales with the number of commits to the default branch, as GitLab will take care of running the merges in order without conflicting pushes
3. It will be compatible with the adoption of the [merge train feature](https://docs.gitlab.com/ee/ci/pipelines/merge_trains.html) in the GitLab project https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/195
4. we are removing a custom implementation in favor of a feature of the product
Things to figure out in our testing:
1. [x] Can we have a blazing-fast CI for the security:master -> canonical:master merge request? https://gitlab.com/gitlab-org/gitlab/-/merge_requests/125862
2. [x] Can we take advantage of [`triage-ops` reactive framework](https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/reactive) to automatically approve and merge this (and only this) type of merge request? https://gitlab.com/gitlab-org/quality/triage-ops/-/merge_requests/2310
3. [ ] Can we have the GitLab-bot as a codeowner approver for every file in the repo? https://gitlab.com/gitlab-org/gitlab/-/merge_requests/126503
## Sequence diagrams
### Today
```mermaid
sequenceDiagram
actor rm as Relese Manager
participant rt as release-tools
participant mt as merge-train
participant gl as gitlab.com
actor Maintainer
rm->>rt: Sync security default branch
rt->>mt: merge sercurity:master->canonical:master
activate mt
mt->>+gl: clone gitlab-org/security/gitlab
gl-->>-mt: security:master
mt->>+gl: clone gitlab-org/gitlab
gl-->>-mt: canonical:master
par Release Process
mt->>mt: merge security:master into canonical:master
and Development Process
Maintainer->>+gl: Merge gitlab-org/gitlab!12345
gl-->>-Maintainer: merged
end
mt->>+gl: git push
gl-->>-mt: ❌ push failure, master advanced
loop up to 5 times if push fails
mt->>+gl: pull gitlab-org/gitlab master
gl-->>-mt: canonical:master
mt->>mt: merge HEAD into canonical:master
mt->>+gl: git push
gl-->>-mt: push result
end
deactivate mt
rm->>rt: Very mirroring status
```
### Desired state
```mermaid
sequenceDiagram
actor rm as Relese Manager
participant rt as release-tools
participant gl as gitlab.com
participant ops as triage-ops
participant runner
actor Maintainer
rm->>rt: Sync security default branch
par Release Process
rt->>+gl: API merge sercurity:master->canonical:master
gl-->>rt: created gitlab-org/gitlab!5555
and Development Process
Maintainer->>+gl: Merge gitlab-org/gitlab!12345
gl-->>-Maintainer: merged
end
runner->>+gl: pick a job
note right of runner: no-op single job pipeline
gl-->>-runner: single job for gitlab-org/gitlab!5555
activate runner
gl->>+ops: mergerequest.opened gitlab-org/gitlab!5555
ops->>gl: approve gitlab-org/gitlab!5555
ops->>gl: set auto-merge gitlab-org/gitlab!5555
ops-->>-gl: done
runner->>gl: job completed
deactivate runner
gl->>gl: auto-merge
deactivate gl
rm->>rt: Very mirroring status
```Alessio CaiazzaAlessio Caiazzahttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1227Add support for triggering auto-deployments on different GitLab instances, su...2023-02-02T18:18:03ZYorick PeterseAdd support for triggering auto-deployments on different GitLab instances, such as dev.gitlab.org or ops.gitlab.netAuto-deployments are triggered using GitLab.com APIs. If GitLab.com goes down,
this means we can't trigger a new auto-deploy.
The code for auto-deploys already supports passing in a different GitLab client,
but based on a quick glance o...Auto-deployments are triggered using GitLab.com APIs. If GitLab.com goes down,
this means we can't trigger a new auto-deploy.
The code for auto-deploys already supports passing in a different GitLab client,
but based on a quick glance over the code it does not change the project paths
according to the client that is used. This means that even when passing in a dev
or ops client, the code will continue to use GitLab.com project paths.
To resolve this, I propose that we add a feature flag for performing the
auto-deploys using a different API host. When this flag is set, we:
1. Change the API client that we use
1. Change the project paths to match the host that is used
Since most build work happens on dev, I think we should use dev as a fallback;
ops.gitlab.net doesn't make much sense as IIRC we don't mirror everything there.
For more information, see the discussion at
https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1214#note_416084871.
## Proposal
Per https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1227#note_445719399:
1. Change the auto-deploy code so that it uses a different project path based on the GitLab API client that is used. This way we'd use the right paths on dev versus GitLab.com
* See https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1227#note_418261805 for some more info
1. Add support for setting an environment flag, and when this flag is set use dev for the auto-deploy process
1. Document that flag
1. Optionally allow setting this flag using chatops. This is probably not necessary as we can just set it using the CI/CD variables settings pagehttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20106Update resource_group process mode of jobs that set environment_state metric2024-03-26T10:47:36ZReuben PereiraUpdate resource_group process mode of jobs that set environment_state metric## Summary
All CI jobs that set the `auto_deploy_environment_state` metric will run in a `resource_group`. This will prevent them from executing at the same time.
The default [process mode](https://docs.gitlab.com/ee/ci/resource_groups...## Summary
All CI jobs that set the `auto_deploy_environment_state` metric will run in a `resource_group`. This will prevent them from executing at the same time.
The default [process mode](https://docs.gitlab.com/ee/ci/resource_groups/index.html#process-modes) for a resource group is `unordered`. We need to change it to `oldest_first`, so that the CI jobs are executed in a first-in first-out pattern (FIFO).
## Proposal
Call the [API](https://docs.gitlab.com/ee/api/resource_groups.html#edit-an-existing-resource-group) to set the process mode to `oldest_first` in [tasks/metrics/set_environment_state.rb#L19](https://gitlab.com/gitlab-org/release-tools/-/blob/aead3cef335b3e570f5d0f5ada53af9313e9b577/lib/release_tools/tasks/metrics/set_environment_state.rb#L19).https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20087Create metric for patch release information + status2024-03-27T20:20:34ZJenny KimCreate metric for patch release information + status### Context
In https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1245, we created the metric `delivery_release_monthly_status` to capture and display information about the monthly release on the [release information dashboard](https...### Context
In https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1245, we created the metric `delivery_release_monthly_status` to capture and display information about the monthly release on the [release information dashboard](https://dashboards.gitlab.net/d/delivery-release_info/delivery3a-release-information?orgId=1).
With this issue, we aim to create a very similar metric `delivery_release_patch_status` for the patch releases.
As captured on [this list of patch release information to display on Grafana](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1255#information-to-display-on-grafana "Release dashboard: Patch release status + information"), let's create that metric with the following labels:
* `versions`: Upcoming patch release versions, leveraging [gitlab-releases gems'](https://gitlab.com/gitlab-org/ruby/gems/gitlab-releases) `next_versions` function in [GitlabReleasesClient in release-tools](https://gitlab.com/gitlab-org/release-tools/-/blob/master/lib/release_tools/gitlab_releases_client.rb?ref_type=heads)
* `release_date`: Upcoming patch release date, similarly to above, leveraging `next_patch_release_date` function
The value of the metric dictates the current status:
* 1 = open: Similar to the ["Open" status for the monthly release status](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1245#grafana-metrics "Create monthly release metrics for release dashboard") (green), this status signifies that any ~"security-target" labelled security issues would be included in the next patch release.
* Created during the "security_release_finalize:start" job of the previous patch release for the next versions. ([internal example](https://gitlab.com/gitlab-org/release/tasks/-/issues/8971#final-steps "Security patch release: 16.9.2, 16.8.4, 16.7.7"))
* 2 = warning: Similar to the ["Announced" status for the monthly release status](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1245#grafana-metrics "Create monthly release metrics for release dashboard") (yellow), this status signals that the merging date for patch release is getting closer.
* Metric value updated 4 business days (Thursday) before the next patch release date (Wednesday).
* Idea for implementation: a pipeline schedule that runs every Thursday UTC 00:00 using [cron job scheduler](https://docs.gitlab.com/ee/topics/cron/index.html#cron-syntax), checks if it's the Thursday before the next patch release date, if so, updates the metric
* 3 = closed: Similar to the ["Tagged RC" status for the monthly release status](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1245#grafana-metrics "Create monthly release metrics for release dashboard") (red), this status signals that security MRs have been merged, and ~"security-target" security issues may no longer be added to the upcoming patch release.
* Metric value updated during default security MR merging pipeline. (`/chatops run release merge --security --default-branch`) ([internal example](https://gitlab.com/gitlab-org/release/tasks/-/issues/8971#two-days-before-due-date-2024-03-04 "Security patch release: 16.9.2, 16.8.4, 16.7.7"))
### Exit Criteria
* [ ] `delivery_release_monthly_status` is created with the `open` status during the finalize job of the previous patch release for the next versions
* [ ] `delivery_release_monthly_status` is updated with the `warning` status 4 business days before the next patch release date
* [ ] `delivery_release_monthly_status` is updated with the `closed` status after the default security MRs are mergedJenny KimJenny Kimhttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20075Design dashboards for discovering deployment blockers2024-03-28T06:39:17ZReuben PereiraDesign dashboards for discovering deployment blockers## Summary
We have 3 metrics as the output of the [spike to experiment with tracking environments' ability to receive a deployment](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19907).
- `delivery_auto_deploy_package_state`...## Summary
We have 3 metrics as the output of the [spike to experiment with tracking environments' ability to receive a deployment](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19907).
- `delivery_auto_deploy_package_state`
This metric tracks the state of every auto-deploy package through the following states: `missing`, `pending`, `building`, `ready`, `failed`.
- `delivery_auto_deploy_environment_state`
This metric will be used to track the state of each environment, like `ready` (ready to receive a deployment), `locked`, `awaiting_promotion` (gstg/gprd can be in this state when there is a package ready to be promoted), `baking_time` (gprd-cny can be in this state when a package is baking on it).
- `delivery_auto_deploy_env_lock_state`
This metric will track why an environment is in the locked state. The previous metric (`delivery_auto_deploy_environment_state`) only tracks if an environment is `locked`. This metric will track why the environment is locked. This can be locked due to ongoing deployment, post-deploy migration, QA. In future iterations, we can also track when an environment is locked due to an incident or change request.
This issue is for discussing what we would like to see in a Grafana dashboard. What will help us discover where to spend our efforts in trying to reduce the time that deployments are blocked?Reuben PereiraReuben Pereirahttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20067Extract version upgrades into a dedicated script for instrumentor configure s...2024-03-27T20:27:02ZAlessio CaiazzaExtract version upgrades into a dedicated script for instrumentor configure stageIn https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20056 we shrieked down the ring0 pipeline to only run the `configure` stage.
That first itereration allowed us to validate that it was possible to run a single stage in absence...In https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20056 we shrieked down the ring0 pipeline to only run the `configure` stage.
That first itereration allowed us to validate that it was possible to run a single stage in absence of external changes (only the GitLab version can change).
The next step is to identify the bare minimum amount of operations we should perform to only upgrade GitLab.
We want to edit the instrumentor image so that for the `configure` image it will add an additional script `bin/deploy` that will ignore all the terraform changes as well as limit the ansible run to only the tag necessary to upgrade the whole installation.Vladimir Glafirovvglafirov@gitlab.comVladimir Glafirovvglafirov@gitlab.comhttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20064Add metric to track locked states of environments2024-03-01T05:02:34ZReuben PereiraAdd metric to track locked states of environments## Summary
The metric being created in https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20062 will have `locked` as one of the states. The metric added in this issue will track why an environment is locked.
An environment can b...## Summary
The metric being created in https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20062 will have `locked` as one of the states. The metric added in this issue will track why an environment is locked.
An environment can be locked by a `deployment`, `QA`, `post-deploy migration`, `failed_deployment`.
```plantuml
state "locked" as locked {
state "locked_deployment" as locked_deployment
state "locked_qa" as locked_qa
state "locked_deployment_failed" as locked_deployment_failed
state "locked_post_deploy_migration" as locked_pdm
[*] --> locked_deployment
[*] --> locked_pdm
locked_deployment --> locked_qa: Deployment completes and QA is triggered.
locked_deployment --> locked_deployment_failed: Deployment fails and env stays locked.
locked_deployment_failed --> locked_qa: Failure is retried and succeeds.
locked_deployment_failed --> locked_deployment: An RM manually unlocks the env and starts a new deployment.
locked_pdm --> locked_qa: PDM on gstg completes and QA starts.
locked_pdm --> [*]: PDM on gprd completes.
locked_qa --> [*]
}
```
## Proposal
- [ ] Create a metric called `auto_deploy_env_lock_state{target_env="", target_stage="", lock_reason=""}`. It gets set to 1 with the appropriate value in label `lock_reason` when an environment is locked.
- [ ] Set the metric when environment is locked for deployment.
- [ ] Set metric when deployment fails.
- [ ] Set metric when QA starts, and unset it when QA ends.
- [ ] Set metric when environment is locked for post deploy migrations, and unset it when post deploy migrations completes.https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20062Create metric to track auto deploy environment states2024-03-07T05:47:45ZReuben PereiraCreate metric to track auto deploy environment states## Summary
Create a metric to track states of environments used for auto deploy deployments, i.e. gstg-cny, gstg-ref, gprd-cny, gstg, gprd.
The metric can have states like `locked`, `ready`, `baking_time`, `awaiting_promotion`. This wa...## Summary
Create a metric to track states of environments used for auto deploy deployments, i.e. gstg-cny, gstg-ref, gprd-cny, gstg, gprd.
The metric can have states like `locked`, `ready`, `baking_time`, `awaiting_promotion`. This was discussed in https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19907#note_1773273653.
```plantuml
title Environment states and their transitions
state "ready" as ready
state "baking_time" as baking_time
state "awaiting_promotion" as awaiting_promotion
awaiting_promotion: This state applies only to gstg and gprd.
baking_time: Only applies to gprd-cny.
ready: Environment is ready to receive another deployment
locked: Environment is locked by a deployment or post-deploy migration or QA, etc.
ready --> locked: Deployment starts
locked --> baking_time: QA on gprd-cny completes and baking time begins.
locked --> awaiting_promotion: QA on gstg/gprd completes and there is another package ready to be promoted.
locked --> ready: QA completes and env is ready for a new deployment.
ready --> awaiting_promotion: Package finishes baking on gprd-cny.
baking_time --> locked: Baking time completes and new deployment starts.
baking_time --> ready: Baking time completes and env is ready for another deployment.
awaiting_promotion --> locked: Deployment begins.
```
## Proposal
This will require release-tools to model the environment states as a state machine, so that we can verify that a state change makes sense. For example, when a deployment is ongoing, an environment should not be moved to the `awaiting_promotion` state. It can be moved to `awaiting_promotion` only from the `ready` state.
- [ ] Create a state machine to model and validate state transitions.
- [ ] Create an `auto_deploy_environment_state` metric.
- [ ] Create rake task for changing the state of the metric.
- [ ] Add CI jobs for changing the state of the metric at specific points in the deployment pipeline.
- [ ] Add CI job to release-tools to set environment state to `locked` when deployment starts.
- [ ] Add CI job to set env state when baking time starts and completes.
- [ ] Add CI job to set env state when post deploy migrations starts and ends.Reuben PereiraReuben Pereirahttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20052Proposal: Align blog post to patch release terminology2024-03-27T21:55:57ZMayra CabreraProposal: Align blog post to patch release terminology## Context
Blog posts for security and patch releases are automatically generated by release tooling. As part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1193, the term "patch releases" is being repurposed for releases that...## Context
Blog posts for security and patch releases are automatically generated by release tooling. As part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1193, the term "patch releases" is being repurposed for releases that include bug and security fixes, which is one of the last steps to officially combine patch and security releases into a single scheduled type, details on https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20015. T
The practice of automatically generating the blog post during the patch release process will continue as-is. This issue is to identify the areas of the blog post that need to be updated to account for the new terminology.
## State-of-art
The blog post consists of 6 sections
1. Title indicating the type of release and the targeted versions
2. A summary informing the content of the release
3. A section for security fixes: Composed by a high-overview table and a description of each fix
4. A section for bug fixes: A list of bug fixes sorted per version
5. A generic section linking to the update information and to subscribe to security notifications
6. Social media actions
<details><summary>Last 5 security release / patch release blog posts:</summary>
* https://about.gitlab.com/releases/2024/02/07/security-release-gitlab-16-8-2-released/
* https://about.gitlab.com/releases/2024/01/12/gitlab-16-7-3-released/
* https://about.gitlab.com/releases/2024/01/25/critical-security-release-gitlab-16-8-1-released/
* https://about.gitlab.com/releases/2024/01/11/critical-security-release-gitlab-16-7-2-released/
* https://about.gitlab.com/releases/2023/12/13/security-release-gitlab-16-6-2-released/
</details>
## Proposal: Blog post adjustments to account for patch release terminology
For the blog post to be aligned with the patch release term, the following updates will be required:
- The title should be updated from `Security release` to `Patch release`
- The content of the summary section should be updated to use `Patch release`
- The file name should be updated from `YYYY/MM/DD/security-release-gitlab-m-n-x-released/` to `YYYY/MM/DD/patch-release-gitlab-m-n-x-released/`
The remaining sections, including the ones for security and bug fixes, stay the same.
## Considerations
1. **What happens if the patch release doesn't include bug fixes?**
Considering bug fixes to the current version can be self-serve and the ongoing influx of backport requests, this would be an edge case. In any event, the blog post should not include the `Patch release` section.
For starters, the "bug fix" section can be manually dropped from the blog post. On upcoming iterations, the blog post tooling generator can be updated to skip this section if the patch doesn't include any bug fix
2. **What happens if the patch release doesn't include security fixes?**
Since 16.7, two security releases have been performed per month, each of these releases included a good number of security fixes (details: [1](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19853), [2](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19957)). Based on the former, it'd also be an edge case for a patch release to not include any security fix.
In this event, the security section can be manually dropped. Later, the blog post tooling generator can be updated to skip this section if the patch doesn't include security fixes.
In this event, AppSec should not be involved so the blog post:
* Should be assigned to release managers, specifying the blog post doesn't contain any security fixes
* AppSec should be notified about this so no security alerts are sent.
3. **Where should the blog post be created?**
The blog post will always be created on security, regardless if it includes security fixes or not. In another iteration, the generation can be smarter to create the blog post on canonical if the patch release only includes bug fixes and on security if the patch includes security fixes.
4. **What categories should be used in the blog post?**
At the moment security blog posts are in the:
* Releases: https://about.gitlab.com/releases/categories/releases/
* Security: https://about.gitlab.com/blog/categories/security/
If the patch release includes bug and security fixes, the categories stay the same. If the patch release only includes bug fixes, the category should be `Releases` only
## To do
### Implementation details
- [ ] Update blog post title, content and file name https://gitlab.com/gitlab-org/release-tools/-/merge_requests/2972
- [ ] Skip the bug fixes section if no bug fixes are present https://gitlab.com/gitlab-org/release-tools/-/merge_requests/2979
- [ ] Skip the security fixes section if no security fixes are present https://gitlab.com/gitlab-org/release-tools/-/merge_requests/2979
- [ ] Refactor the blog post to consider
* Patch release blog post for three versions
* They can include bug and security fixes, or bug fixes and security fixes only
* They might be planned or unplanned (critical)
* A single version blog post (in case a dedicated blog post a single version is ever needed)
- [ ] Add the patch release blog post to upcoming ones
- [x] Assign release managers if the blog post doesn't include security fixes, assign appsec if the blog post include security fixes (already implemented)
- [ ] Notify AppSec managers if the blog post doesn't include security fixes.
- [ ] Follow up: Automatically detect if the patch release includes bug and security fixes
- [ ] Follow up: Create the blog post in canonical if the patch release doesn't include security fixes.
- [ ] Follow up: Document on how the blog post works
### Template
- [ ] A blog post template should be created and agreed with AppSecMayra CabreraMayra Cabrerahttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20045Build dashboards for displaying release-environments metrics2024-02-27T13:05:30ZAhmad TolbaBuild dashboards for displaying release-environments metrics# What
This issue is to tackle the building of release environments needed for the dashboard. It's currently blocked by: https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20044
# Milestones
* [ ] We have dashboards that show ...# What
This issue is to tackle the building of release environments needed for the dashboard. It's currently blocked by: https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20044
# Milestones
* [ ] We have dashboards that show the component's pod info:
* [ ] API pod info
* [ ] Web pod info
* [ ] Sidekiq pod info
* [ ] Websockets pod infohttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20044On board release-environment to Mimir2024-02-27T13:05:22ZAhmad TolbaOn board release-environment to Mimir# Why
This issue is to onboard the release environments to `Mimir`, the new backend for our metrics.
It's gonna be tackled by joint effort between The observability team (@rnaveiras) and Delivery team.
# What
The current setup we have...# Why
This issue is to onboard the release environments to `Mimir`, the new backend for our metrics.
It's gonna be tackled by joint effort between The observability team (@rnaveiras) and Delivery team.
# What
The current setup we have only installed Prometheus on release-environments cluster, there's no way we can access the metrics except for using `port-forward` from the pods. This issue is dedicated to tackling the remote write from Prometheus to Mimir and to accessing the metrics using Grafana.
# Milestones
- [ ] release-environments has the remote-write capability to Mimir
- [ ] Grafana can display the metrics for release-environments.https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20027Investigate "Failed to log to appender" errors2024-02-23T11:26:37ZReuben PereiraInvestigate "Failed to log to appender" errorsWe've been seeing errors like the following in release-tools CI job logs for many months now. However, the logs do get uploaded to ElasticSearch since they are visible in https://nonprod-log.gitlab.net/.
This issue is for investigating ...We've been seeing errors like the following in release-tools CI job logs for many months now. However, the logs do get uploaded to ElasticSearch since they are visible in https://nonprod-log.gitlab.net/.
This issue is for investigating the error.
Job logs from https://ops.gitlab.net/gitlab-org/release/tools/-/jobs/12864336:
```
2024-02-16 00:46:47.684337 E [17:SemanticLogger::Appenders] SemanticLogger::Appenders -- Failed to log to appender: SemanticLogger::Appender::ElasticsearchHttp -- Exception: Net::ReadTimeout: Net::ReadTimeout with #<TCPSocket:(closed)>
/usr/local/lib/ruby/3.2.0/net/protocol.rb:229:in `rbuf_fill'
/usr/local/lib/ruby/3.2.0/net/protocol.rb:199:in `readuntil'
/usr/local/lib/ruby/3.2.0/net/protocol.rb:209:in `readline'
/usr/local/lib/ruby/3.2.0/net/http/response.rb:158:in `read_status_line'
/usr/local/lib/ruby/3.2.0/net/http/response.rb:147:in `read_new'
/usr/local/lib/ruby/3.2.0/net/http.rb:2342:in `block in transport_request'
/usr/local/lib/ruby/3.2.0/net/http.rb:2333:in `catch'
/usr/local/lib/ruby/3.2.0/net/http.rb:2333:in `transport_request'
/usr/local/lib/ruby/3.2.0/net/http.rb:2306:in `request'
/usr/local/bundle/gems/sentry-ruby-5.16.1/lib/sentry/net/http.rb:30:in `request'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appender/http.rb:234:in `process_request'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appender/http.rb:213:in `post'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appender/elasticsearch_http.rb:70:in `log'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appenders.rb:31:in `block in log'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appenders.rb:30:in `each'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appenders.rb:30:in `log'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appender/async.rb:152:in `process_messages'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appender/async.rb:121:in `process'
/usr/local/bundle/gems/semantic_logger-4.15.0/lib/semantic_logger/appender/async.rb:77:in `block in thread'
```Reuben PereiraReuben Pereirahttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20008Make sure only one running deployment to a release environment at a time2024-02-16T09:06:01ZDat TangMake sure only one running deployment to a release environment at a timeWe need to make sure that there is only one deployment to a release environment running at a time, thus we can use the deployment result as the validation for a change (i.e. MR) on a stable branch.
Some ideas:
* In auto_deploy pipelin...We need to make sure that there is only one deployment to a release environment running at a time, thus we can use the deployment result as the validation for a change (i.e. MR) on a stable branch.
Some ideas:
* In auto_deploy pipeline, we use Chef roles to do the same thing.
* We can use [resource group](https://docs.gitlab.com/ee/ci/resource_groups/) feature of GitLab pipeline to make sure no two pipelines run at the same time on release environment. It should be done on gitlab canonical repo, since it is where release environment deployment is triggered.
### Exit Criteria
- [ ] only one running deployment to a release environment at a time
- [ ] deployments happen in orderhttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20006PoC auto-deploy packages deployment on switchboard_uat2024-02-28T15:02:35ZAlessio CaiazzaPoC auto-deploy packages deployment on switchboard_uatThis is a discovery issue to track a PoC where each auto-deploy package rollout will simulate a ring 0 deployment by triggering a pipeline in switchboard_uat using [the delivery tenant](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedic...This is a discovery issue to track a PoC where each auto-deploy package rollout will simulate a ring 0 deployment by triggering a pipeline in switchboard_uat using [the delivery tenant](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/sandbox/switchboard_uat/-/blob/main/tenant_models/deliverysandbox831f36.json).
### Status
The PoC repository is https://gitlab.com/gitlab-com/gl-infra/cells-tissueAlessio CaiazzaAlessio Caiazzahttps://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20004Scalability of release-environments2024-03-01T13:23:50ZAhmad TolbaScalability of release-environments# The problem
We have seen issues with the `pre` environment when we run QA pipelines against it. Pods sometimes go out of resources due to insufficient calibration. I am just curious/afraid that we face the same issues with release env...# The problem
We have seen issues with the `pre` environment when we run QA pipelines against it. Pods sometimes go out of resources due to insufficient calibration. I am just curious/afraid that we face the same issues with release environments as seen in:
- https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19491
- https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20003
# Proposal
Observe and measure the QA pipelines that run against release environments with our metrics/logging to see if we have the same issue as before and act accordingly to scale release environments if needed.
Currently blocked by: https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19970 & https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20005https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20003Pre environment enhancement for when QA runs.2024-02-27T00:40:14ZAhmad TolbaPre environment enhancement for when QA runs.# The problem
The pre-environment has fewer resources than other environments. This, however, makes sense. Is a bit of a burden when QA runs, for example. This sometimes causes `OOM` issues and `502` on the [QA jobs](https://ops.gitlab....# The problem
The pre-environment has fewer resources than other environments. This, however, makes sense. Is a bit of a burden when QA runs, for example. This sometimes causes `OOM` issues and `502` on the [QA jobs](https://ops.gitlab.net/gitlab-org/quality/preprod/-/jobs/12654334).
We fixed this by bumping the pod's mem/CPU several times, as seen in [1](https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/6396) and [2](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/3240), [3](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/3396). But it's a recurring theme now; after some time, the issue gets re-opened with the modification to QA that happens by Quality.
Addressed in https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19491
# Proposal
Adjust the HPA sensitivity to scale up pods for pre-components (Web, Api, WebSockets..) when needed.
Right now, the threshold is a bit high, not non-existent, and the pods are stuck at two pods maximum for `web` and `API,` where the QA job hits the most.
---
At the moment:
For `API mem`:
![Screenshot_2024-02-07_at_18.00.38](/uploads/fd28ee33dbc2d73e905e5601f28829a4/Screenshot_2024-02-07_at_18.00.38.png)
And for `Websockets mem`:
![Screenshot_2024-02-07_at_18.01.50](/uploads/e6225f9c6fe50c8f2d33c845d338b9c4/Screenshot_2024-02-07_at_18.01.50.png)
---
...https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19999Slack notification for Release Environment pipelines' result2024-03-27T10:48:03ZDat TangSlack notification for Release Environment pipelines' resultTo improve the visibility and usability of Release Environment, its pipelines' result should be posted on Slack, ideally similar to what we are doing with staging/prod environments (see the screenshot below).
![Screenshot 2024-03-01 at ...To improve the visibility and usability of Release Environment, its pipelines' result should be posted on Slack, ideally similar to what we are doing with staging/prod environments (see the screenshot below).
![Screenshot 2024-03-01 at 14.16.44.png](/uploads/d2e6d9964282fe8b68854421eb6e4f56/Screenshot_2024-03-01_at_14.16.44.png)
### Exit Criteria
* [ ] Slack notifications are sent when release environment pipelines succeed/fail
* [ ] The notification has a link to the pipeline
* [ ] (Optional) The notification has a link to the failed job with the job name
### Out of Scope
* The notification related to QA/Test Platform team is done in https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/20104Dat TangDat Tang