2021-10-28: Enable registry DB in gprd

Production Change

Change Summary

After finishing the readiness review and extensive testing in pre and gstg, we want to enable the Registry metadata DB in gprd with this CR for Epic &577 (closed), delivery#2023 (closed).

This will enable us to migrate container registry metadata from a GCS bucket over to the DB later in a controlled, gradual way.

This CR is enabling the metadata DB in the container registry configuration, but not triggering any metadata migrations - this will be done later controlled by feature flags. So basically the registry will just start doing some DB read requests by GC background workers, to see if there are any garbage collection tasks (there won't be any in the DB), but nothing else.

Risks

There are two kinds of risks: deployment related (low severity) and application related (high severity).

Deployment related: With enabling the DB configuration, new deployments will try to run DB schema migrations which might fail. But in that case there would be no customer impact, as the new pods just wouldn't be scaled up and after 1h helm would give up to deploy the new release and we could fix and retry.
Application related: We are using new code paths and are dependent on the DB cluster now. But we tested extensively in pre and gstg, and without enabling migrations there should be no customer impact for now even if the DB cluster should fail, as all data still remains in the bucket.

Change Details

Services Impacted - ServiceContainer Registry
Change Technician - @hphilipps
Change Reviewer - @skarbek
Time tracking - 60 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5m

Set label changein-progress on this issue
make sure the MRs are approved: canary, gprd
make sure to have coverage by package team (@hswimelar or @jdrpereira)
manually run db migrations if we are missing some for the current version
- copy a basic config to a pod in /tmp/config.yml:

    version: 0.1
log:
  accesslog:
    formatter: json
  fields:
    service: registry
  formatter: json
  level: info

database:
  enabled: true
  host: pgbouncer-registry.int.gprd.gitlab.net
  port: 6432
  user: gitlab-registry
  password: <FOUND_IN_GKMS>
  dbname: gitlabhq_registry
  sslmode: disable
  pool:
    maxidle: 5
    maxopen: 5
    maxlifetime: 5m

on the pod, run registry database migrate up /tmp/config.yml

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 50m

Merge canary MR
watch deployment pipeline

make sure nothing is added to the DB

ssh hphilipps-registry-db@console-01-sv-gprd.c.gitlab-production.internal
select * from top_level_namespaces;

make sure there was not gitlab/ root prefix created in the GCS bucket
```
gsutil ls gs://gitlab-gprd-registry/gitlab
```
monitor registry in gprd-cny for at least 30m (see monitoring section below)
rebase gprd MR
- ensure the job diffs look like expected (only registry changes for gprd, no changes for canary or the db password)
merge gprd MR
watch deployment pipeline

make sure nothing is added to the DB

ssh hphilipps-registry-db@console-01-sv-gprd.c.gitlab-production.internal
select * from top_level_namespaces;

make sure there was not gitlab/ root prefix created in the GCS bucket
```
gsutil ls gs://gitlab-gprd-registry/gitlab
```

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 30m

monitor registry in gprd for at least 30m (see monitoring section below)
wait for the next gprd deploy to happen and watch for any issues with db migrations or containers failing to connect to the DB

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10m

revert gprd MR if already applied
revert cny MR

Monitoring

Key metrics to observe

Metric: K8s Registry Deployment status
- Location: Thanos. Also kubectl -n gitlab-cny describe deployment gitlab-cny-registry or kubectl -n gitlab describe deployment gitlab-registry
- What changes to this metric should prompt a rollback: a significant drop of pods
Metric: K8s Cluster Scaleup errors
- Location: Grafana Kube dashboard
- What changes to this metric should prompt a rollback: a constant increase of scaleup errors
Metric: Container Registry SLOs
- Location: Registry Overview Dashboard
- What changes to this metric should prompt a rollback: Any significant degradation of APDEX or increase of error rate that could be caused by the change. There should be a very low DB request rate though, which might skew the DB APDEX - we should more rely on the garbage collector SLIs.
Metric: Container Registry DB Detail
- Location: Dashboard
- What changes to this metric should prompt a rollback: DB pool connection saturation or high (>1s) db wait times
Metric: Patroni Registry DB SLOs
- Location: Dashboard
- What changes to this metric should prompt a rollback: Any significant SLO degradation or increased error rate

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

No new instances are added or resized. But we enable the usage of an already provisioned new Patroni cluster for container registry.

Changes checklist

Edited Nov 02, 2021 by Henri Philipps