Experimentation to reduce downtime of Dedicated deployments
### Context
[Cells](https://about.gitlab.com/direction/enablement/tenant-scale/#moving-towards-cells) (aka Tenant Scale) are the new direction GitLab is taking in the longer horizon. This translates to a [different GitLab Infrastructure organization](https://docs.google.com/document/d/113moKoycmCtjyYwng2kKOj1KrevbH8D6T_eQqnCTwNo/edit#heading=h.leo855jcxw7l), different scale, and also different capabilities needed to guarantee reliable deployments and rollbacks where each Cell represents a critical deployment unit and where deployments and rollbacks at scale (fleet of Cells) are going to play a vital role in offering a reliable platform to the customers.
Delivery Group is [adopting this new direction](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19547 "Delivery Group FY24 Q3 direction") to create a better production deployments posture for Cells.
### Problem
The first iteration of Cells will be based on [Dedicated](https://gitlab-com.gitlab.io/gl-infra/gitlab-dedicated/team/) architecture. Dedicated is currently based on a different infrastructure than GitLab.com, using a different architecture and tooling. Customers of Dedicated have maintenance windows where upgrades and patches are applied. These upgrades often require downtime, even if minimal.
While we built a robust solution for releases and deployments for GitLab.com, the challenges for Cells will be different due to the different scale and impact deployments will have.
The initial problem we aim to solve is having zero downtime deployments to Cells. This is a mandatory requirement, a possible solution could leverage capabilities that we already started to build for GitLab.com in Q2 via the [dynamic routing traffic effort](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/962 "Develop ability to dynamically route traffic within different cluster deployments").
### Goal
_Define a detailed plan (tooling, features, rollout plan) to develop zero downtime deployments to Dedicated environments that can be implemented in FY24-Q4_
we plan to achieve this goal through:
Investigating and fixing the gaps in deployments of a Dedicated environment (Experimentation Environment) to reduce downtime and risk..
1. Familiarize with Dedicated architecture, tooling, and challenges and be able to contribute independently to the solution
2. Review tooling, and solutions and propose a technical solution that Dedicated and Delivery Group agrees on
3. Implement a PoC solution for rolling out GitLab packages on the Experimentation environment
4. Implement a Blue/Green style deployment in the Experimentation environment to PoC Dedicated upgrades with no downtime.
#### Outcome
- Learnings from the experimentation and decisions on how to build the full solution are documented
- New components and functionality that need to be built in Dedicated and its tooling are identified
### DRI
@nolith
### References
- [OKR related to this effort](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3546)
- [Issue board](https://gitlab.com/gitlab-com/gl-infra/delivery/-/boards/6030299?label_name%5B%5D=Dedicated%20deployments)
- [Dedicated Documentation entry point](https://gitlab-com.gitlab.io/gl-infra/gitlab-dedicated/team/)
- [Delivery Guide to Dedicated](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/gitlab-dedicated.md?ref_type=heads)
### Demos
<table>
<tr>
<th>Demo Date</th>
<th>Demo Link</th>
<th>Highlights</th>
</tr>
<tr>
<td>2023-08-17</td>
<td>
https://youtube.com/live/2Znvmtel284
</td>
<td>
- [Dedicated Documentation Rundown](https://www.youtube.com/watch?v=2Znvmtel284&t=242s)
- [Explanation of the problem with Zero Downtime Deployment](https://www.youtube.com/watch?v=2Znvmtel284&t=2218s)
</td>
</tr>
<tr>
<td>2023-08-24</td>
<td>
https://www.youtube.com/watch?v=gZLCJZfmupg
</td>
<td>
Discussion geared toward our current Sandbox onboarding experiences and the issues faced. Discussion also around deployments based on the current knowledge of Dedicated and clarifying various understandings.</td>
</tr>
<tr>
<td>2023-08-31</td>
<td>
https://youtube.com/live/txHlhL57lpE
</td>
<td>
- [Instance Upgrade Demo](https://www.youtube.com/watch?v=txHlhL57lpE&t=18s)
- [Explanation of why we split Pre and Post migrations (from demo discussion)](https://www.youtube.com/watch?v=txHlhL57lpE&t=1722s)
- [General Agenda Questions](https://www.youtube.com/watch?v=txHlhL57lpE&t=2232s)
</td>
</tr>
<tr>
<td>2023-09-07</td>
<td>
https://www.youtube.com/watch?v=xZfCQD_gnrg
</td>
<td>
- [GET zero downtime upgrade investigation for Dedicated Demo](https://www.youtube.com/watch?v=xZfCQD_gnrg&t=16s)
- [Differences between Sandbox installation and regular Dedicated Tenant](https://www.youtube.com/watch?v=xZfCQD_gnrg&t=2215s)
- [Discussion around zero downtime upgrade on Cloud Native Hybrid Installation](https://www.youtube.com/watch?v=xZfCQD_gnrg&t=2543s)
</td>
</tr>
<tr>
<td>2023-09-14</td>
<td>
https://www.youtube.com/watch?v=9bjHoXBDQhI
</td>
<td>
- [Demo splitting regular and post-deployment migrations](https://www.youtube.com/watch?v=9bjHoXBDQhI&t=30s)
- [Updates to delivery guide to dedicated](https://www.youtube.com/watch?v=9bjHoXBDQhI&t=1480s)
- [Delivery Long Live UAT Environment](https://www.youtube.com/watch?v=9bjHoXBDQhI&t=2240s)
</td>
</tr>
<tr>
<td>2023-09-21</td>
<td>
https://www.youtube.com/live/7hbyeUbBQbU
</td>
<td>
* [tenant upgrade demo](https://www.youtube.com/watch?v=7hbyeUbBQbU&t=72s)
* [nginx sticky session demo and discussion](https://www.youtube.com/live/7hbyeUbBQbU?feature=shared&t=383)
</td>
</tr>
<tr>
<td>2023-09-28</td>
<td>
https://www.youtube.com/watch?v=lfY_CDqRJI4
</td>
<td>
- [Sticky session upgrade with PDM](https://www.youtube.com/live/lfY_CDqRJI4?feature=shared&t=68)
- [Run QA against an auto-deploy package in an Instrumentor sandbox](https://www.youtube.com/live/lfY_CDqRJI4?feature=shared&t=244)
- [Demo setup: post-deployment migrations and sticky sessions](https://www.youtube.com/watch?v=lfY_CDqRJI4&t=73s)
- [Demo: QA run for auto_deploy images in Instrumentor](https://www.youtube.com/watch?v=lfY_CDqRJI4&t=248s)
- [Demo: post-deployment migrations and sticky sessions](https://www.youtube.com/watch?v=lfY_CDqRJI4&t=1403s)
- [From the demo: sticky session discussion](https://www.youtube.com/watch?v=lfY_CDqRJI4&t=1848s)
- [Discussion](https://www.youtube.com/watch?v=lfY_CDqRJI4&t=2189s)
</td>
</tr>
<tr>
<td>2023-10-05</td>
<td>
https://www.youtube.com/watch?v=NHFxKx0TF6U
</td>
<td>
- [Traffic generation and Apdex monitoring for Dedicated sandbox Demo](https://www.youtube.com/watch?v=NHFxKx0TF6U&t=20s)
- [Discussion items](https://www.youtube.com/watch?v=NHFxKx0TF6U&t=1747s)
</td>
</tr>
<tr>
</td>
</tr>
<tr>
<td>2023-10-12</td>
<td>
https://www.youtube.com/watch?v=aJvecdnERAE
</td>
<td>
- [CMBR traffic generation from the UAT pipeline](https://www.youtube.com/watch?v=aJvecdnERAE&t=35s)
- [Discussion topics](https://www.youtube.com/watch?v=aJvecdnERAE&t=1286s)
</td>
</tr>
<tr>
</td>
</tr>
<tr>
<td>2023-10-19</td>
<td>
https://www.youtube.com/watch?v=E80vHaAz57s
</td>
<td>
- [Showcasing of Deployment downtime test results](https://www.youtube.com/watch?v=E80vHaAz57s&t=23s)
- [Discussion around GEO DB migrations and Gitlab charts](https://www.youtube.com/watch?v=E80vHaAz57s&t=1013s)
</td>
</tr>
<tr>
</td>
</tr>
<tr>
<td>2023-10-26</td>
<td>
https://www.youtube.com/watch?v=Ui5lTUyJ8GA
</td>
<td>
- [Discussions](https://www.youtube.com/watch?v=Ui5lTUyJ8GA&t=25s)
</td>
</tr>
<tr>
</td>
</tr>
</table>
### Milestones
* [x] Evaluate options for tooling based on Dedicated/Cells needs and document them
* [x] Delivery Group team members have their own Dedicated cell available for development
* [x] New packages are rolled out on Experimental Dedicated Cell
* [x] Experimental Dedicated Tenant is upgraded ~~via Blue/Green deployment~~ with no downtime
* [x] Dedicated Rollout Metrics are collected and can be visualized in a Dashboard
* [x] Learnings and implementation details to officially add this support into Dedicated in FY24-Q4 are documented
### Administrative
For all new issues that are children of this Epic, utilize the following template:
```plaintext
/epic &1092
/label ~"Delivery::P2" ~"workflow-infra::Triage" ~"group::delivery" ~"Dedicated deployments"
```
## Decision log
<details>
<summary>
<details>
<summary>2023-09-07</summary></details>
</summary>
- This project will have a long-lived sandbox account to test upgrades and collect metrics https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1109
</details>
## Development log
<details>
<summary>Details</summary>
<details>
<summary>2023-08-10</summary>
* This epic defines the first iteration of work that is reflecting the new [Delivery Group Direction](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19547 "Delivery Group FY24 Q3 direction")
* The focus of this Epic is on solving the problem of zero downtime deployments, applying [learnings from Q2](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/962 "Develop ability to dynamically route traffic within different cluster deployments") to an experimentation Dedicated stack. Gather insights, and define a plan to later on (Q4) move to an implementation phase on Dedicated. The focus is on Dedicated as currently it is the building block of Cells.
* Q3 Initial Issues are created
* Delivery Group team members are focusing on [ramping up knowledge on the Dedicated Stack](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19552 "Create a Delivery Guide to Dedicated")
* [Role Entitlements](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19550 "Delivery Group members have Dedicated Role Entitlements") and [Onboarding issues](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19551 "Delivery Group members has completed the dedicated sandbox tutorial") are created and waiting for Dedicated input
</details>
<details>
<summary>2023-08-16</summary>
* [Delivery guide to dedicated](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19552 "Create a Delivery Guide to Dedicated") is being finalized.
* [Discussions and demo](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19552#links-learnings "Create a Delivery Guide to Dedicated")
* [Learnings and documentation will be added to release docs](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19552#note_1515163926 "Create a Delivery Guide to Dedicated").
* ~"group::GitLab Dedicated" is facilitating onboarding with permissions and documentation for ~"group::delivery" to be able to [spin up sandbox dedicated instances](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19551 "Delivery Group members has completed the dedicated sandbox tutorial").
* Validating the granted permissions before closing https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19550+.
</details>
<details>
<summary>2023-08-24</summary>
* [Delivery guide to dedicated](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19552 "Create a Delivery Guide to Dedicated") is merged and [now present in Release Docs](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/gitlab-dedicated.md?ref_type=heads)
* Role Entitlements for non Production for Dedicated has been [provided to all team members](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19550#milestones "Delivery Group members have Dedicated Role Entitlements")
* [Spinning up sandbox dedicated instances](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19551 "Delivery Group members has completed the dedicated sandbox tutorial") in progress. ~"group::GitLab Dedicated" is helping with Sandbox spin-up. This is still in progress as only some of the team members completed it.
* Validating the granted permissions before closing https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19550+.
</details>
<details>
<summary>2023-08-30</summary>
* All the team members actively working on this epic completed their [sandbox onboarding](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19551 "Delivery Group members has completed the dedicated sandbox tutorial"). This is still in progress as some team members will join our effort later in the quarter.
* We started [investigating the current dedicated upgrade process](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19557 "Investigate and document sandbox upgrade process on the Instrumentor level"). The process is a bit brittle and we are trying to document knowledge on how to overcome common challanges.
</details>
<details>
<summary>2023-09-08</summary>
- Our effort to understand dedicated deployments continues with more team members getting up to speed with GET and Instrumentor. Our investigation issues are getting cut into narrower steps as we learn more. Our main goal is to understand where dedicated deployments differ from gitlab.com and identify the action points for a zero-downtime upgrade implementation.
- The team received the necessary approvals to setup a long-lived sandbox for this project https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1109
- We started contributing to the dedicated codebase, we [simplified the tenant model schema to make it easy to run auto_deploy packages](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/tenant-model-schema/-/merge_requests/116 "fix: allow prerelease_version when enable_fips is false") and had our [first CI sandbox running an auto_deploy package](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/merge_requests/2058 "fix: override all images repo for prerelease_versions")
- These changes will now provide us with many daily upgrade opportunities without destroying the environment and losing all the accumulated metrics.
</details>
<details>
<summary>2023-09-14</summary>
No blockers.
- Our UAT sandbox environment is now live with auto_deploy packages :rocket:
- We are deploying using an instrumentor integration branch to have a faster development cycle without affecting the dedicated team workload. https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/merge_requests/2093
- With our learning phase reaching the final stages, we are ready to parallelize our efforts in the following areas: DB migrations, load balancer, monitoring and traffic generation. [mindmap](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1092#note_1559114287 "Experimentation to reduce downtime of Dedicated deployments")
</details>
<details>
<summary>2023-09-20</summary>
- We are still discussing the best way forward to give delivery engineers direct access to the tenant resources (i.e. kubernetes API) https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1109#note_1566745570
- We ran three deployments this week, turning off post-deployment migrations, and the environment was online and serving traffic during the upgrade.
- We started investigating the load balancer setup https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19641 and identified a non-graceful shutdown during the nginx-ingress upgrade. We are now exploring sticky sessions at nginx to see if we can find a solution using the default components of the reference architecture.
- Unfortunately, QA is not running in our environment. That feature is not a shared part of the dedicated stack but is implemented in Instrumentor testing and switchboard_uat. In both cases, it does not work with prereleases https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19660
- We hope next week to have full access to the environment and and start working on downtime monitoring https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19642
</details>
<details>
<summary>2023-09-27</summary>
No blockers.
- Our tenant is now fully operational and all the engineers have been onboarded
- Load balancer investigation https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19641:
- We had a working PoC of sticky sessions with the nginx-ingress
- Another approach using an assets shim ingress was tested successfully
- We started working on QA for prerelease_versions https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19660
- With the release managers rotation we are now close to complete the onboarding https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19551
- **Next**
- Produce a way to effectively monitor downtime accordingly with the SLA SLO https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19642
- Understand the side effects of enabling sticky sessions on the routing performance
</details>
<details>
<summary>2023-10-04</summary>
No blockers.
- We are working on QA for prerelease_versions https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19660
- We are working on effectively monitoring downtimes during upgrades https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19642
- We are investigating traffic generation tools to better monitor the instance behavior https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19691
- Everyone in the team completed the onboarding https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19551
- **Next**
- Understand the side effects of enabling sticky sessions on the routing performance
- Write a blueprint to explain our findings and proposals
</details>
<details>
<summary>2023-10-12</summary>
No blockers.
- We had our first automated QA run successfully on our tenant using the correct auto_deploy image https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19660
- We continue working on traffic generation by using CMBR https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19691
- The two above will help us monitoring downtimes during upgrades https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19642
- **Next**
- Understand the side effects of enabling sticky sessions on the routing performance
- Write a blueprint to explain our findings and proposals
</details>
<details>
<summary>2023-10-19</summary>
[OKR Progress](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3546): 76% (pilot for https://gitlab.com/gitlab-com/gl-infra/mstaff/-/issues/255)
No blockers.
- With traffic generation we started monitoring the behavior during an application upgrade
- We started working on a blueprint describing our findings and proposals https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/merge_requests/598
- To finalize the post-deployment migrations handling we either need https://gitlab.com/gitlab-org/build/CNG/-/merge_requests/1548 or we will have to extract DB migrations outside of GET
- **Next**
- Understand the side effects of enabling sticky sessions on the routing performance
- Find a reproducible upgrade path with downtime and test it with our improvements
</details>
<details>
<summary>2023-10-26</summary>
[OKR Progress](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3546): 80% (pilot for https://gitlab.com/gitlab-com/gl-infra/mstaff/-/issues/255)
No blockers.
- Work is still in progress on:
- [Understanding the use of sticky sessions on load balancers](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19641) to solve the asset routing
- [Implementation of Post Deployment Migrations on Dedicated](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19688#top)
- The final blueprint describing our findings and proposals https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/merge_requests/598 is in review
- We identified actionable items to better monitor our progress while implementing the blueprint in Q4
A couple of open items will be carried on into the next quarter as lower priority, but providing valuable insights.
</details>
<details>
<summary>2023-11-02</summary>
[OKR Progress](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3546): End of Q3 **final score** 83%
The only remaining open issue is the [Blueprint for Zero Downtime in Dedicated](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19786): the MR https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/merge_requests/598 is in review and comments are being addressed.
Some items in progress and relevant for the continuation of this effort has been moved to the [Q4 Epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1150):
- [Implementation of Post Deployment Migrations on Dedicated](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19688#top) is almost complete.
- [Collecting k8s-workloads stored knowledge](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19632)
- [Delivery-specific UAT environment cannot send emails](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19668)
- [Porting post-deployment migration handling to GET](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19689) that will be unblocked by the [Implementation of Post Deployment Migrations on Dedicated](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19688)
</details>
<details>
<summary>2023-11-09</summary>
No blockers.
The only remaining open issue is the [Blueprint for Zero Downtime in Dedicated](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/19786): the MR https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/merge_requests/598 is in review and comments are being addressed.
</details>
</details>
### Status 2023-11-13
With [the blueprint](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/merge_requests/598) merged, this work is now complete :rocket:
The Epic can be closed and moved to ~"workflow-infra::Done" :tada:
### MindMap
```mermaid
%%{init: {'theme':'default'}}%%
mindmap
root((ZDU))
Learning
onboarding with development sandboxes
upgrade process analysis
Instrumentor
GET
Migrations
split regular/post deployment migrations
💡 decouple PDM from deployment
Load balancer
rollout optimization
assets routing
integration with the rollout
instrumentor or GET?
Upgrade automations
today: pick a package and upgrade the tenant model
💡 manual job in release-tools to bump the version
Monitoring
traffic generation
mirroring some busy progect
QA
traffic generation tools
downtime evidences
Tuning
gitaly
deployment analysis
opt["deployment optimization (if needed)"]
chart values
differences with .com
GEO secondary site
disabled on the first iteration
```
epic