Migrate Camoproxy to K8s in GPRD

Production Change

Change Summary

We want to migrate camoproxy from VMs into K8s. Epic: &90 (closed)

The basic steps are as follows:

change the asset-proxy configuration of GitLab to point to the new DNS entry
invalidate the markdown cache to make all existing external images use the new DNS entry
point existing DNS entry user-content.gitlab-static.net to new ingress IP and create SSL cert for it on the new ingress LB
switch asset-proxy configuration of GitLab back to point to the old DNS entry again
invalidate markdown cache again

Risks

We only have 10-20 requests/s to camoproxy, which means we only have a very small percentage of requests using camoproxy at all. If camoproxy should fail, then users would see empty content at the place of the external image but no other functionality of GitLab would be compromised and the rest of the page would still be rendered. Thus the risk of the change itself is relatively small.

There is a DB performance degradation risk though: we need to invalidate the markdown cache, which will lead to increased writes on the primary DB over the next hours and days - all markdown will be re-rendered and the cache entry in the DB updated at the first request after the invalidation. See gitlab-org/gitlab#330313 (closed) for some context.

Thus we should consider to execute the change during low traffic times or on a weekend and get review from the database team.

Change Details

Services Impacted - ServiceCamoproxy
Change Technician - @ahyield
Change Reviewer - @skarbek
Time tracking - 4h
Downtime Component - none

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 180m

Make sure you have access to AWS console (for route53 DNS entry of user-content.gitlab-static.net)
Announce that no one should deploy changes to gitlab-helmfiles while this change is ongoing
Set label changein-progress /label ~change::in-progress
Obtain gitlab.com Admin PAT token and export to shell
```
export ADMIN_TOKEN='<XXXXX>'
```
switch to k8s context gprd
```
glsh kube use-cluster gprd
```

check that temporary cert is existing and active (cert should be fine for this image)

kubectl -n camoproxy describe managedcertificates.networking.gke.io managed-cert-tmp

change asset-proxy config in gprd to use the new URL - new external images will use the new proxy from now on, old markdown will still be cached and use the old proxy

curl --request "PUT" "https://gitlab.com/api/v4/application/settings?asset_proxy_url=https://user-content-2.gitlab-static.net" --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}"

restart web pods in all clusters to make config active

glsh kube use-cluster gprd-us-east1-b
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

glsh kube use-cluster gprd-us-east1-c
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

# wait for completion of rollout before doing the next 2!

glsh kube use-cluster gprd-us-east1-d
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

glsh kube use-cluster gprd
kubectl -n gitlab-cny rollout restart deployment/gitlab-cny-webservice-web

# wait for completion of rollout!

Test: henri.philipps/hptest1#1
- Adding a new image should now point to user-content-2.gitlab-static.net instead of user-content.gitlab-static.net while existing images should still point to the old proxy

invalidate old markdown cache in DB by increasing the markdown version (will cause increased DB writes over the next hours and switch all requests over to the new proxy!)

# get current markdown version (should be `5`)
curl --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings" | jq | grep markdown
# bump markdown version to `6` (if it was `5` previously, see above)
curl --request PUT --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings?local_markdown_version=6"

Test: henri.philipps/hptest1#1 - old images should point to user-content-2.gitlab-static.net now too. Use curl to check - browsers show strange caching behaviour.
Check for traffic switching from VMs to K8s in Thanos
manually point DNS entry of old domain (user-content.gitlab-static.net) to new ingress LB IP (remember old value - should be 35.190.114.86) in AWS console

enable and deploy managed-cert for user-content.gitlab-static.net on ingress LB

# first remove the existing unvalidated managed-cert to speed up validation
cat <<EOF | kubectl delete -f -
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata:
  name: managed-cert

# now create it again
cat <<EOF | kubectl apply -f -
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata:
  name: managed-cert
spec:
  domains:
    - user-content.gitlab-static.net
EOF

add managed-cert to ingress

kubectl -n camoproxy annotate --overwrite ingress camoproxy networking.gke.io/managed-certificates='managed-cert-tmp,managed-cert'

wait for cert to be provisioned and active (can take up to 1h):

kubectl -n camoproxy describe managedcertificates.networking.gke.io managed-cert

test if we see a valid cert for the https://user-content.gitlab-static.net/d3e75310678bc10a9b271811add3c37e1afafdf0/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f662f66302f5468655f476f5f476f706865722e6a7067
change asset-proxy config in gprd back to use the old URL - new external images should use the old URL again from now on, recently created markdown will still be cached and use the new temporary URL
```
curl --request "PUT" "https://gitlab.com/api/v4/application/settings?asset_proxy_url=https://user-content.gitlab-static.net" --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}"
```

restart web pods in all clusters to make config active

glsh kube use-cluster gprd-us-east1-b
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

glsh kube use-cluster gprd-us-east1-c
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

# wait for completion of rollout before doing the next 2!

glsh kube use-cluster gprd-us-east1-d
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

glsh kube use-cluster gprd
kubectl -n gitlab-cny rollout restart deployment/gitlab-cny-webservice-web

# wait for completion of rollout!

Test: henri.philipps/hptest1#1
- Adding a new image should now point to user-content.gitlab-static.net instead of user-content-2.gitlab-static.net, while previously existing images should stay unchanged because of markdown cache.

invalidate markdown cache again (shouldn't cause much additional strain, as most markdown is still not updated anyway)

# get current markdown version (should be `6` now)
curl --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings" | jq | grep markdown
# bump markdown version to `7` (if it was `6` previously, see above)
curl --request PUT --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings?local_markdown_version=7"

Test: henri.philipps/hptest1#1 - all images should point to user-content.gitlab-static.net now. Use curl to check - browsers show strange caching behaviour.
check for any issues (see monitoring section)
Set label changecomplete /label ~change::complete

Post Change Steps

switch to k8s context gprd
```
glsh kube use-cluster gprd
```

remove managed-cert-tmp from ingress (we only keep managed-cert)

kubectl -n camoproxy annotate --overwrite ingress camoproxy networking.gke.io/managed-certificates='managed-cert'

remove managed-cert-tmp

cat <<EOF | kubectl delete -f -
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata:
  name: managed-cert-tmp

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10m

The rollback steps very much depend on from where we need to roll back. So I will not suggest a detailed plan here - we basically would need to do similar steps as above. If we should be in big trouble in a late state of this CR, then one option would be to just completely disable camoproxy.

disable camoproxy in GitLab - (external images will be linked directly, which could lead to security notifications in browsers)

curl --request PUT --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings?asset_proxy_enabled=false"

restart web pods in all clusters to make config active

glsh kube use-cluster gprd-us-east1-b
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

glsh kube use-cluster gprd-us-east1-c
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

# wait for completion of rollout before doing the next 2!

glsh kube use-cluster gprd-us-east1-d
kubectl -n gitlab rollout restart deployment/gitlab-webservice-web

glsh kube use-cluster gprd
kubectl -n gitlab-cny rollout restart deployment/gitlab-cny-webservice-web

# wait for completion of rollout!

invalidate markdown cache

# get current markdown version (should be `7` now)
curl --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings" | jq | grep markdown
# bump markdown version to `8` (if it was `7` previously)
curl --request PUT --header "PRIVATE-TOKEN: ${ADMIN_TOKEN}" "https://gitlab.com/api/v4/application/settings?local_markdown_version=8"

Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: Camoproxy Dashboard
- Location: Grafana
- What changes to this metric should prompt a rollback: Any significant drop in requests or increase in errors
Logs: Camoproxy Logs
- Location: Elastic
- What changes should prompt a rollback: Any significant drop in requests or increase in errors

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Jun 17, 2022 by John Skarbek