2021-10-28: Enable registry DB in gprd
Production Change
Change Summary
After finishing the readiness review and extensive testing in pre and gstg, we want to enable the Registry metadata DB in gprd with this CR for Epic &577 (closed), delivery#2023 (closed).
This will enable us to migrate container registry metadata from a GCS bucket over to the DB later in a controlled, gradual way.
This CR is enabling the metadata DB in the container registry configuration, but not triggering any metadata migrations - this will be done later controlled by feature flags. So basically the registry will just start doing some DB read requests by GC background workers, to see if there are any garbage collection tasks (there won't be any in the DB), but nothing else.
Risks
There are two kinds of risks: deployment related (low severity) and application related (high severity).
- Deployment related: With enabling the DB configuration, new deployments will try to run DB schema migrations which might fail. But in that case there would be no customer impact, as the new pods just wouldn't be scaled up and after 1h helm would give up to deploy the new release and we could fix and retry.
- Application related: We are using new code paths and are dependent on the DB cluster now. But we tested extensively in pre and gstg, and without enabling migrations there should be no customer impact for now even if the DB cluster should fail, as all data still remains in the bucket.
Change Details
- Services Impacted - ServiceContainer Registry
-
Change Technician -
@hphilipps
- Change Reviewer - @skarbek
- Time tracking - 60 minutes
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5m
-
Set label changein-progress on this issue -
make sure the MRs are approved: canary, gprd -
make sure to have coverage by package team (@hswimelar or @jdrpereira) -
manually run db migrations if we are missing some for the current version - copy a basic config to a pod in
/tmp/config.yml
:
- copy a basic config to a pod in
version: 0.1
log:
accesslog:
formatter: json
fields:
service: registry
formatter: json
level: info
database:
enabled: true
host: pgbouncer-registry.int.gprd.gitlab.net
port: 6432
user: gitlab-registry
password: <FOUND_IN_GKMS>
dbname: gitlabhq_registry
sslmode: disable
pool:
maxidle: 5
maxopen: 5
maxlifetime: 5m
- on the pod, run
registry database migrate up /tmp/config.yml
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 50m
-
Merge canary MR -
watch deployment pipeline -
make sure nothing is added to the DB ssh hphilipps-registry-db@console-01-sv-gprd.c.gitlab-production.internal select * from top_level_namespaces;
-
make sure there was not gitlab/
root prefix created in the GCS bucketgsutil ls gs://gitlab-gprd-registry/gitlab
-
monitor registry in gprd-cny for at least 30m (see monitoring section below) -
rebase gprd MR -
ensure the job diffs look like expected (only registry changes for gprd, no changes for canary or the db password)
-
-
merge gprd MR -
watch deployment pipeline -
make sure nothing is added to the DB ssh hphilipps-registry-db@console-01-sv-gprd.c.gitlab-production.internal select * from top_level_namespaces;
-
make sure there was not gitlab/
root prefix created in the GCS bucketgsutil ls gs://gitlab-gprd-registry/gitlab
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 30m
-
monitor registry in gprd for at least 30m (see monitoring section below) -
wait for the next gprd deploy to happen and watch for any issues with db migrations or containers failing to connect to the DB
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10m
Monitoring
Key metrics to observe
- Metric: K8s Registry Deployment status
- Location: Thanos. Also
kubectl -n gitlab-cny describe deployment gitlab-cny-registry
orkubectl -n gitlab describe deployment gitlab-registry
- What changes to this metric should prompt a rollback: a significant drop of pods
- Location: Thanos. Also
- Metric: K8s Cluster Scaleup errors
- Location: Grafana Kube dashboard
- What changes to this metric should prompt a rollback: a constant increase of scaleup errors
- Metric: Container Registry SLOs
- Location: Registry Overview Dashboard
- What changes to this metric should prompt a rollback: Any significant degradation of APDEX or increase of error rate that could be caused by the change. There should be a very low DB request rate though, which might skew the DB APDEX - we should more rely on the garbage collector SLIs.
- Metric: Container Registry DB Detail
- Location: Dashboard
- What changes to this metric should prompt a rollback: DB pool connection saturation or high (>1s) db wait times
- Metric: Patroni Registry DB SLOs
- Location: Dashboard
- What changes to this metric should prompt a rollback: Any significant SLO degradation or increased error rate
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
No new instances are added or resized. But we enable the usage of an already provisioned new Patroni cluster for container registry.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.