2023-09-18: Migrate packages.gitlab.com to k8s/GCP
Production Change
Change Summary
This CR will cutover packages.gitlab.com and take us from an unscalable and somewhat brittle service hosted on a single VM (plus a couple of managed services) on AWS to the way we run most of our other services on GKE.
We hope these changes will result in a more reliable & scalable packages.gitlab.com.
See https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24245 for more info.
Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24348
Epic: &922 (closed)
Architecture
FROM
![](/-/project/7444821/uploads/1976967692fc5b1d22d337909d085720/image.png)
TO
![](/-/project/7444821/uploads/426a767fe93fbb70be7e1c613794288d/image.png)
Change Details
- Services Impacted - ServicePackageCloud
-
Change Technician -
@gsgl
- Change Reviewer - @f_santos
- Time tracking - 240min to roll forward / 30min to roll back
- Downtime Component - internal only - no package uploads for approximately 2-3 hours. Users should face no downtime downloading packages from our repositories.
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 240
-
Set label changein-progress /label ~change::in-progress
-
Stop chef client on packages VM chef-client-disable "Migrating to k8s: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16271"
-
Stop resque (ensure no package indexing takes place from this point) packagecloud-ctl stop resque
-
Stop rainbows (ensure no package uploads take place from this point) packagecloud-ctl stop rainbows
-
On the packages VM, export all DB data except the downloads table using the db_dump.sh script: OUTPUT_FILE=all-except-downloads.gz /var/opt/packagecloud/database-backups/db_dump.sh all-except-downloads
This should take approximately 20 minutes to run.
-
On the packages VM, upload DB dump to GCS: cd /root google-cloud-sdk/bin/gcloud storage cp /var/opt/packagecloud/database-backups/all-except-downloads.gz gs://packagecloud-db-dumps/
This should take less than a minute. Gcloud is already configured in
/root/.config/gcloud
with the right credentials to push into that bucket. -
Scale all packagecloud deployments to 0
-
This is to avoid anything writing to the DB while we're restoring.
kubectl -n packagecloud scale --replicas=0 deployments/packagecloud-rainbows deployments/packagecloud-web deployments/packagecloud-resque deployments/packagecloud-toolbox
-
-
Open https://console.cloud.google.com/sql/instances/packagecloud-f05c90f5/databases?project=gitlab-ops and delete the packages_onpremise
database. Quicker than dropping & recreating every table. -
On the same page, create a new database with these parameters: - Database Name:
packages_onpremise
- Character Set:
utf8mb4
- Collation:
utf8mb4_general_ci
- Database Name:
-
Open https://console.cloud.google.com/sql/instances/packagecloud-f05c90f5/import?project=gitlab-ops -
Source: Click Browse -> Select "packagecloud-db-dumps" ->
all-except-downloads.gz
-
File format: SQL
-
Destination Database:
packages_onpremise
-
Click Import
The data import takes approximately 1.5hrs. You can monitor the progress at https://console.cloud.google.com/sql/instances/packagecloud-f05c90f5/operations?project=gitlab-ops.
-
-
Once the import has finished, scale all packagecloud deployments to 1
. The HPA should take care of adding more pods.kubectl -n packagecloud scale --replicas=1 deployments/packagecloud-rainbows deployments/packagecloud-web deployments/packagecloud-resque deployments/packagecloud-toolbox
-
Wait for the packagecloud pods to be in Running
state. -
Perform basic validation: -
Open your browser
-
Use your favorite plugin to add the header
route-to-k8s: true
to outgoing requests -
Open https://packages.gitlab.com in a new tab
-
Login as an admin
-
Open https://packages.gitlab.com/admin/indexers - ensure workers start with
packagecloud-resque-XXXX
.NOTE: if the names of the workers have the format
packages:NNNNN:indexer/delete
then STOP! You're looking at the VM so before proceeding, figure out why. -
You should see
gitlab-ce
,gitlab-ee
, etc with package counts. Compare these with the previous tab. They should be identical. -
Click into
gitlab-ce
and then the first package on the list -
Take note of the number of downloads just above the
wget
command on the right hand side -
Download the package using the
wget
command ensuring you add theroute-to-k8s: true
header. For example:wget \ --header="route-to-k8s: true" \ --content-disposition \ https://packages.gitlab.com/gitlab/gitlab-ce/packages/debian/bullseye/gitlab-ce_16.3.2-ce.0_amd64.deb/download.deb
-
Reload the page and you should see the counter increase by
1
.
-
-
On the packages VM, export the downloads table: OUTPUT_FILE=only-downloads.gz /var/opt/packagecloud/database-backups/db_dump.sh only-downloads
This should take approximately 8-10 minutes to run.
-
On the packages VM, upload DB dump to GCS: cd /root google-cloud-sdk/bin/gcloud storage cp /var/opt/packagecloud/database-backups/only-downloads.gz gs://packagecloud-db-dumps/
This should take 20-30 seconds.
-
Open https://console.cloud.google.com/sql/instances/packagecloud-f05c90f5/import?project=gitlab-ops -
Source: Click Browse -> Select "packagecloud-db-dumps" ->
only-downloads.gz
-
File format: SQL
-
Destination Database:
packages_onpremise
-
Click Import
The data import will take up to an hour. You can monitor the progress at https://console.cloud.google.com/sql/instances/packagecloud-f05c90f5/operations?project=gitlab-ops.
-
-
Wait until the import is complete by checking the Operations page -
Merge the following MR to point packages.gitlab.com at GCP/k8s: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/6661 -
Remove the old DNS record from the packagecloud state: cd /path/to/config-mgmt/environments/packagecloud tf state rm 'cloudflare_record.packagecloud_dns'
-
Merge the following MR to remove the old packages.gitlab.com record from environments/packagecloud
: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/6711 -
Monitor the nginx log on the VM. It should eventually cease as DNS propagates: tail -f /var/log/packagecloud/nginx/packagecloud_access.log
-
Test that you can upload a package using the package_cloud
CLI into your own personal repo -
Test that you can configure the GitLab repo and install a package from it: curl -s https://packages.gitlab.com/install/repositories/gitlab/gitlab-ee/script.deb.sh | bash apt-get install gitlab-ee
-
Set label changecomplete /label ~change::complete
Rollback
To prevent interfering with the release preparation date, we will set a hard deadline of 2023-09-18 06:00 UTC
. If the
change is not complete by then, we will begin to perform the following rollback steps and coordinate a new date to
attempt the CR again.
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30
Rollback should ideally take place before uploading any new packages. If any new packages have been uploaded since the cutover, they should be re-uploaded once the rollback steps are complete:
-
Ensure services are running on the VM: packagecloud-ctl start resque packagecloud-ctl start rainbows
-
Revert https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/6661 -
Test https://packages.gitlab.com by performing similar tests to the ones carried out during the roll forward. -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
-
Metric: 500 errors
- Location: https://dash.cloudflare.com/852e9d53d0f8adbd9205389356f2303d/gitlab.com/analytics/traffic?host=packages.gitlab.com&status-code~geq=500
- What changes to this metric should prompt a rollback: A concerning rise in the number of 500 errors.
-
Metric: loadbalancer SLI error ratio
- Location: https://dashboards.gitlab.net/d/packagecloud-main/packagecloud-overview
- What changes to this metric should prompt a rollback: A concerning rise in the number of errors.
-
Metric: CloudSQL metrics (near the bottom of the page)
- Location: https://dashboards.gitlab.net/d/packagecloud-main/packagecloud-overview
- What changes to this metric should prompt a rollback: We may need to upsize the instance if it is overwhelmed but if upsizing the instance still doesn't alleviate the performance bottleneck then we may need to look at rolling back.
-
Metric: HPA current replicas vs max replicas
- Location: Thanos
- What changes to this metric should prompt a rollback: We expect that we'll have to tweak min/max and pod resources. If we cannot strike a good balance of these settings and it's getting close to the deadline then we should consider rolling back.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.