2023-01-03: Enable direct deb download for SaaS omnibus installation
Production Change
Change Summary
For https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16340, this change request will allow us to start installing auto-deploy packages from a GCS bucket, instead of downloading and installing from package cloud.
There are two parts to this change request:
part1
When the omnibus package is built, if GITLAB_COM_PKGS_SA_FILE
is set the omnibus builder pipeline will upload the omnibus package to GCS in addition to uploading the package to package cloud.
The risk of this step is that the omnibus job will fail and we will need to unset the var and retry. The logic to upload the package to packagecloud has remained unchanged.
part2
When the omnibus package is installed using Ansible deploy-tooling
, with DEB_INSTALL_ENABLE
we will first see if there is a valid package in the GCS bucket. If so, we will use it. If that copy fails for any reason we will revert to the old behavior of installing from packagecloud.
The risk of this step is that if there is a unexpected failure the deployer pipeline will fail, in that case we can unset the env var and retry.
References:
- Add
gcloud
to omnibus builder pipeline image gitlab-org/gitlab-omnibus-builder!246 (merged) - Add feature flag to
omnis-gitlab
to upload auto-deploy packages to the GCS bucket gitlab-org/omnibus-gitlab!6567 (merged) - Deploy-tooling change to use GCS packages for download and installation https://ops.gitlab.net/gitlab-com/gl-infra/deploy-tooling/-/merge_requests/407
Change Details
- Services Impacted - ServiceInfrastructure
-
Change Technician -
@jarv
- Change Reviewer - @skarbek
- Time tracking - unknown
- Downtime Component - none
Detailed steps for the change
-
Set label changein-progress /label ~change::in-progress
-
Ensure that https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4749 is merged and applied.
gitlab-com-pkgs
GCS bucket
PART1: Upload autodeploy packages to the the time to complete this task will depend on the next auto-deploy package that is built, we should wait no longer than 10hours or the time for someone to monitor this
-
Set GITLAB_COM_PKGS_SA_FILE
in the dev omnibus-gitlab CI variable section (`https://dev.gitlab.org/gitlab/omnibus-gitlab/-/settings/ci_cd -
Confirm that in the Ubuntu 20.04 job (https://dev.gitlab.org/gitlab/omnibus-gitlab/-/jobs/13668929#L3586 as an example) https://dev.gitlab.org/gitlab/omnibus-gitlab/-/jobs we are now uploading packages to the GCS bucket. -
Confirm that the object in the bucket as the path, should look something like: gs://gitlab-com-pkgs/ubuntu-{focal,xenial}/gitlab-ee_15.8.202212231220-59d3236fca5.78a3d16fa5b_amd64.deb
gitlab-com-pkgs
GCS bucket
PART2: Install using deployer from the the time to complete this task will depend on the next auto-deploy package that is deployed, we should wait no longer than 10hours or the time for someone to monitor this
-
Set DEB_INSTALL_ENABLE
in CI variables fordeployer
https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/settings/ci_cd . -
Watch the next deploy and confirm we are downloading and installing from a deb file. -
Set label changecomplete /label ~change::complete
Rollback
-
part1: Rename GITLAB_COM_PKGS_SA_FILE
to___GITLAB_COM_PKGS_SA_FILE
in https://dev.gitlab.org/gitlab/omnibus-gitlab/-/settings/ci_cd -
part2: Rename DEB_INSTALL_ENABLE
to___DEB_INSTALL_ENABLE
in https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/settings/ci_cd
Monitoring
- Omnibus builder jobs https://dev.gitlab.org/gitlab/omnibus-gitlab/-/jobs
- Deploy tooling pipelines https://ops.gitlab.net/gitlab-com/gl-infra/deploy-tooling/-/pipelines
- Also monitor
#announcements
for upgrade events.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.