Road to CD
As we have GCP migration of GitLab.com successfully completed, we are now on the road to replacing parts of infrastructure with Kubernetes.
To achieve this, we need to create number of tools to change the engineering habits, revamp the release process and bring infrastructure closer to development stage.
Current situation
Development
Developers create features that target master
branch. Basic testing is done by developers using specs/tests and once the reviewer is happy with the status they merge into master
branch. When something lands there, it basically becomes someone elses problem to deploy to GitLab.com environments, monitor the impact in those environments and finally release to public.
Developers have access to staging to test impact on larger data set than they have in local environments.
Developers can also create packages from their branches to carry out manual tests or fire off GitLab QA tests using the package-and-qa
job in their branch. This mimics an empty production like environment.
Release
When the time of the month comes, Release managers take whatever is in master
branch and start following the release process to create a version of GitLab that will be deployed to GitLab.com and released to public. Release managers deal with ce-to-ee merges and merge conflict resolution as well as preparing the packages for deployment and deployment to GitLab.com environments.
When Release manager triggers a release for the first time, master
branch is used to create a stable
branch and that fires off a number of other processes using release-tools
.
These can be summed up with:
- All remotes get synced (dev.gitlab.org, gitlab.com) for GitLab CE and EE
- Tag gets created for CE/EE repositories
-
*_VERSION
files get read for each component of GitLab and that gets propagated to the omnibus-gitlab repository - Tag gets created for the omnibus-gitlab repository
- The omnibus-gitlab repository starts a build which pulls all of the components with specific versions of each library that GitLab uses.
- Between 20-40 minutes later, packages for all supported distros are created and uploaded to private package repository
Release, quality and infrastructure
When the packages are ready, Release managers deploy to staging.gitlab.com
. Staging has a percentage size of production database and also a percentage of git data. Deploys currently take between 10-15 minutes and are a manual action triggered by Release managers using the takeoff
project. During this period, Release managers stare at the screen.
As deploy gets completed, developers get a chance to test their changes with a bigger data set for the first time. Automated staging tests are also carried out during this period. QA deadline is set and developers are asked to check their changes on staging.
In case of a problem, issue gets logged by developers and depending on severity of a problem, this can block further release. Release manager then needs to coordinate with developers and wait for the resolution of a problem.
If no problems are reported and QA deadlines passed, Release managers proceed to deployment to canary. This is usually quick but can sometimes have consequences for GitLab.com given that the database is shared. If no further problems are found on canary, Release proceeds with deployment to GitLab.com.
Deployment to GitLab.com can take anything upwards of 20 minutes, in some cases taking hours. In case of an issue caused by migration (due to large data set) or unpredictable usage pattern (number of CI builds, abuse or DOS attack), deploy can have sever effects on availability of GitLab.com. Further more, performance regressions and regular regressions can have high impact on productivity given the number of users and the fact that all changes are rolled out at the same time.
In case of an aborted deploy and regression, either post-deployment patch needs to be applied or a fix needs to be created by developers which brings the release back to the first step.
Want's
We want to have developers test their change and ensure quality of their features way before the feature is merged into master
branch.
When the feature does land in master
branch, we want to make sure that the feature can be deployed as soon as it lands.
Furthermore, we want to be able to roll out the change only to a subset of users first, observe the impact of the changes on performance and only roll out the change to everyone when we have certain level of confidence.
If we roll out the change for everyone and something unexpected happens, we want to be able to turn off the feature or roll back the change that caused the problem without having to roll back all other items that came in in the mean time.
Finally, we want to be able to create a public release of what has already been rolled out to GitLab.com and have confidence that the release will work for our on-premises customers.
Providing developers production like test environments with Review apps
Description of problem in https://gitlab.com/gitlab-org/release/framework/issues/1#note_99453801
Work in progress:
- Documentation for Review apps: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21574
- Review apps in useable state: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/6665
Production like database on demand
Description of problem in https://gitlab.com/gitlab-org/release/framework/issues/1#note_99453801
Work in progress:
- Automating setup of PG replica using pg_basebackup: https://gitlab.com/gitlab-com/infrastructure/issues/4909
Faster testing and QA
Description of problem in https://gitlab.com/gitlab-org/release/framework/issues/1#note_99485657
Work in progress:
- Faster gitlab-qa runs gitlab-org/gitlab-qa#276 (closed)
Controlling the impact of features
Description of problem in https://gitlab.com/gitlab-org/gitlab-ce/issues/49619#note_93138925
Work in progress:
- Documenting the use of feature flags: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21478
- Create a method for using feature flags: https://gitlab.com/gitlab-org/gitlab-ce/issues/50127
- Expiration deadline of feature flags: https://gitlab.com/gitlab-org/gitlab-ce/issues/50128