Migrate Camoproxy to K8s in GPRD
Production Change
Change Summary
We want to migrate camoproxy from VMs into K8s. Epic: &90 (closed)
The basic steps are as follows:
- change the asset-proxy configuration of GitLab to point to the new DNS entry
- invalidate the markdown cache to make all existing external images use the new DNS entry
- point existing DNS entry
user-content.gitlab-static.net
to new ingress IP and create SSL cert for it on the new ingress LB - switch asset-proxy configuration of GitLab back to point to the old DNS entry again
- invalidate markdown cache again
Risks
We only have 10-20 requests/s to camoproxy, which means we only have a very small percentage of requests using camoproxy at all. If camoproxy should fail, then users would see empty content at the place of the external image but no other functionality of GitLab would be compromised and the rest of the page would still be rendered. Thus the risk of the change itself is relatively small.
There is a DB performance degradation risk though: we need to invalidate the markdown cache, which will lead to increased writes on the primary DB over the next hours and days - all markdown will be re-rendered and the cache entry in the DB updated at the first request after the invalidation. See gitlab-org/gitlab#330313 (closed) for some context.
Thus we should consider to execute the change during low traffic times or on a weekend and get review from the database team.
Change Details
- Services Impacted - ServiceCamoproxy
- Change Technician - @ahyield
- Change Reviewer - @skarbek
- Time tracking - 4h
- Downtime Component - none
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 180m
-
Make sure you have access to AWS console (for route53 DNS entry of user-content.gitlab-static.net
) -
Announce that no one should deploy changes to gitlab-helmfiles while this change is ongoing -
Set label changein-progress /label ~change::in-progress
-
Obtain gitlab.com Admin PAT token and export to shell export ADMIN_TOKEN='<XXXXX>'
-
switch to k8s context gprd
glsh kube use-cluster gprd
-
check that temporary cert is existing and active (cert should be fine for this image) kubectl -n camoproxy describe managedcertificates.networking.gke.io managed-cert-tmp
-
change asset-proxy config in gprd to use the new URL - new external images will use the new proxy from now on, old markdown will still be cached and use the old proxy curl --request "PUT" "https://gitlab.com/api/v4/application/settings?asset_proxy_url=https://user-content-2.gitlab-static.net" --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}"
-
restart web pods in all clusters to make config active glsh kube use-cluster gprd-us-east1-b kubectl -n gitlab rollout restart deployment/gitlab-webservice-web glsh kube use-cluster gprd-us-east1-c kubectl -n gitlab rollout restart deployment/gitlab-webservice-web # wait for completion of rollout before doing the next 2! glsh kube use-cluster gprd-us-east1-d kubectl -n gitlab rollout restart deployment/gitlab-webservice-web glsh kube use-cluster gprd kubectl -n gitlab-cny rollout restart deployment/gitlab-cny-webservice-web # wait for completion of rollout!
-
Test: henri.philipps/hptest1#1 - Adding a new image should now point to
user-content-2.gitlab-static.net
instead ofuser-content.gitlab-static.net
while existing images should still point to the old proxy
- Adding a new image should now point to
-
invalidate old markdown cache in DB by increasing the markdown version (will cause increased DB writes over the next hours and switch all requests over to the new proxy!) # get current markdown version (should be `5`) curl --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings" | jq | grep markdown # bump markdown version to `6` (if it was `5` previously, see above) curl --request PUT --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings?local_markdown_version=6"
-
Test: henri.philipps/hptest1#1 - old images should point to user-content-2.gitlab-static.net
now too. Usecurl
to check - browsers show strange caching behaviour. -
Check for traffic switching from VMs to K8s in Thanos -
manually point DNS entry of old domain (user-content.gitlab-static.net) to new ingress LB IP (remember old value - should be 35.190.114.86
) in AWS console -
enable and deploy managed-cert
for user-content.gitlab-static.net on ingress LB# first remove the existing unvalidated managed-cert to speed up validation cat <<EOF | kubectl delete -f - apiVersion: networking.gke.io/v1 kind: ManagedCertificate metadata: name: managed-cert # now create it again cat <<EOF | kubectl apply -f - apiVersion: networking.gke.io/v1 kind: ManagedCertificate metadata: name: managed-cert spec: domains: - user-content.gitlab-static.net EOF
-
add managed-cert
to ingresskubectl -n camoproxy annotate --overwrite ingress camoproxy networking.gke.io/managed-certificates='managed-cert-tmp,managed-cert'
-
wait for cert to be provisioned and active (can take up to 1h): kubectl -n camoproxy describe managedcertificates.networking.gke.io managed-cert
-
test if we see a valid cert for the https://user-content.gitlab-static.net/d3e75310678bc10a9b271811add3c37e1afafdf0/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f662f66302f5468655f476f5f476f706865722e6a7067
-
change asset-proxy config in gprd back to use the old URL - new external images should use the old URL again from now on, recently created markdown will still be cached and use the new temporary URL curl --request "PUT" "https://gitlab.com/api/v4/application/settings?asset_proxy_url=https://user-content.gitlab-static.net" --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}"
-
restart web pods in all clusters to make config active glsh kube use-cluster gprd-us-east1-b kubectl -n gitlab rollout restart deployment/gitlab-webservice-web glsh kube use-cluster gprd-us-east1-c kubectl -n gitlab rollout restart deployment/gitlab-webservice-web # wait for completion of rollout before doing the next 2! glsh kube use-cluster gprd-us-east1-d kubectl -n gitlab rollout restart deployment/gitlab-webservice-web glsh kube use-cluster gprd kubectl -n gitlab-cny rollout restart deployment/gitlab-cny-webservice-web # wait for completion of rollout!
-
Test: henri.philipps/hptest1#1 - Adding a new image should now point to
user-content.gitlab-static.net
instead ofuser-content-2.gitlab-static.net
, while previously existing images should stay unchanged because of markdown cache.
- Adding a new image should now point to
-
invalidate markdown cache again (shouldn't cause much additional strain, as most markdown is still not updated anyway) # get current markdown version (should be `6` now) curl --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings" | jq | grep markdown # bump markdown version to `7` (if it was `6` previously, see above) curl --request PUT --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings?local_markdown_version=7"
-
Test: henri.philipps/hptest1#1 - all images should point to user-content.gitlab-static.net
now. Usecurl
to check - browsers show strange caching behaviour. -
check for any issues (see monitoring section) -
Set label changecomplete /label ~change::complete
Post Change Steps
-
switch to k8s context gprd
glsh kube use-cluster gprd
-
remove managed-cert-tmp
from ingress (we only keepmanaged-cert
)kubectl -n camoproxy annotate --overwrite ingress camoproxy networking.gke.io/managed-certificates='managed-cert'
-
remove managed-cert-tmp
cat <<EOF | kubectl delete -f - apiVersion: networking.gke.io/v1 kind: ManagedCertificate metadata: name: managed-cert-tmp
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10m
The rollback steps very much depend on from where we need to roll back. So I will not suggest a detailed plan here - we basically would need to do similar steps as above. If we should be in big trouble in a late state of this CR, then one option would be to just completely disable camoproxy.
-
disable camoproxy in GitLab - (external images will be linked directly, which could lead to security notifications in browsers) curl --request PUT --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings?asset_proxy_enabled=false"
-
restart web pods in all clusters to make config active glsh kube use-cluster gprd-us-east1-b kubectl -n gitlab rollout restart deployment/gitlab-webservice-web glsh kube use-cluster gprd-us-east1-c kubectl -n gitlab rollout restart deployment/gitlab-webservice-web # wait for completion of rollout before doing the next 2! glsh kube use-cluster gprd-us-east1-d kubectl -n gitlab rollout restart deployment/gitlab-webservice-web glsh kube use-cluster gprd kubectl -n gitlab-cny rollout restart deployment/gitlab-cny-webservice-web # wait for completion of rollout!
-
invalidate markdown cache # get current markdown version (should be `7` now) curl --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings" | jq | grep markdown # bump markdown version to `8` (if it was `7` previously) curl --request PUT --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings?local_markdown_version=8"
-
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Camoproxy Dashboard
- Location: Grafana
- What changes to this metric should prompt a rollback: Any significant drop in requests or increase in errors
- Logs: Camoproxy Logs
- Location: Elastic
- What changes should prompt a rollback: Any significant drop in requests or increase in errors
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.