Container Registry: Apply post-deployment migrations in v3.81.0
Production Change
Change Summary
Request to manually apply 7
new post-deployment migrations included in the container registry v3.81.0 in pre
, gstg
, and gprd
environments.
These new post-deployment migrations are related to Enable registry to process OCI 1.1 reference ma... (gitlab-org/container-registry#967 - closed) and the Software Supply Chain Security Working Group. The change was introduced and reviewed/approved by the Database team in gitlab-org/container-registry!1350 (merged).
Please read the Context
section in this runbook to understand why manual intervention is currently needed.
Target post-deployment migrations:
-
20230723085831_post_add_fk_manifests_subject_id_manifests_not_valid
: Adds a new (NOT VALID
) foreign key constraint to all the 63 partitions of themanifests
table; -
20230724040947_post_validate_fk_manifests_subject_id_manifests_batch_1
: Validates the previously added foreign key constraint for a batch of table partitions. -
20230724040949_post_validate_fk_manifests_subject_id_manifests_batch_2
: Validates the previously added foreign key constraint for a batch of table partitions. -
20230724040951_post_validate_fk_manifests_subject_id_manifests_batch_3
: Validates the previously added foreign key constraint for a batch of table partitions. -
20230724040952_post_validate_fk_manifests_subject_id_manifests_batch_4
: Validates the previously added foreign key constraint for a batch of table partitions. -
20230724040953_post_validate_fk_manifests_subject_id_manifests_batch_5
: Validates the previously added foreign key constraint for a batch of table partitions. -
20230724040955_post_add_fk_manifests_subject_id_manifests_parent
: Adds foreign key constraint to the parentmanifests
table.
Change Details
- Services Impacted - ServiceContainer Registry
- Change Technician - @ahyield
- Change Reviewer - @dat.tang.gitlab
- Time tracking - 45 minutes (sum of all environments)
- Downtime Component - NA
Detailed steps for the change
Repeat for each environment:
Change Steps - steps to take to execute the change
PRE / GSTG
Estimated Time to Complete (mins) - 5 minutes for pre
/gstg
, 40 minutes for gprd
-
Set label changein-progress /label ~change::in-progress
-
Proceed as described here. Please note that for gprd
we must follow the additional steps required for long-running migrations.pre
andgstg
databases are small enough that we don't need to worry about timeouts so the extra steps are not required there.
GPRD
-
Proceed as described here. Please note that for gprd
we must follow the additional steps required for long-running migrations.pre
andgstg
databases are small enough that we don't need to worry about timeouts so the extra steps are not required there. -
Set label changecomplete /label ~change::complete
if no environments left
Rollback
NA. The post-deployment migrations included in this release introduce constraints for a column that remains unused. In the worst-case scenario, the creation of a constraint fails and aborts the execution. The only side effect is that we would have to repeat this change after a fix was released.
Monitoring
Key metrics to observe
- Metric: Postgres CPU Usage
- Location: https://dashboards.gitlab.net/d/patroni-registry-main/patroni-registry-overview?orgId=1&viewPanel=92
- What changes to this metric should prompt a rollback: If an abnormal CPU usage spike is observed around the execution of this change, please abort the ongoing CLI command.
There are no relevant application metrics to observe here as the introduced column will remain unused until later.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.