2023-12-14: Enable running ClickHouse migrations during deploy
Production Change
Change Summary
In Run ClickHouse migrations during gitlab:db:conf... (gitlab-org/gitlab!138661 - merged) we introduced running ClickHouse migrations during the gitlab:db:configure
which is run during the normal deploy.
It's behind the run_clickhouse_migrations_automatically
feature flag, so we only need to enable the FF and see that the deployment succeeds.
Since in theory enabling such feature flag can block/fail production deployments, I decided to create a change-management issue.
Note about locks/sidekiq workers: when we initiate ClickHouse migration we:
- acquire an exclusive lock for running migrations
- pause ClickHouse background sync workers queue
- wait till existing ClickHouse background sync workers finish their job
- run migrations
- unpause workers
There're 2 more feature flags:
- wait_for_clickhouse_workers_during_migration
- pause_clickhouse_workers_during_migration
Making it a total of 3 feature flags:
FF name | meaning | enabled on production before this change is executed |
---|---|---|
run_clickhouse_migrations_automatically |
Run or not CH migration together with gitlab:db:configure - disable this in case of migrations failure and redeploy | - [ ] |
wait_for_clickhouse_workers_during_migration |
Waiting for existing workers to finish - try disabling this if migration step is blocked for too long | - [x] |
pause_clickhouse_workers_during_migration |
Pausing workers to avoid conflicts try disabling this is ClickHouse workers pile in the paused queue | - [x] |
Change Details
- Services Impacted - ServiceDeploy-Node ServiceDeployTooling
-
Change Technician -
@vshushlin
- Change Reviewer -
- Time tracking - unknown
- Downtime Component - none
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
All of the steps below should first be done on staging. -
Set label changein-progress /label ~change::in-progress
-
Execute /chatops run feature set run_clickhouse_migrations_automatically true
-
Wait for the deploy to finish and check the deploy logs -
Go to the rails console and execute this: ::ClickHouse::Client.select("SELECT * from schema_migrations", :main)
It should return something like:
=> [{"version"=>"20230705124511", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:11.750000000 UTC +00:00}, {"version"=>"20230707151359", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:11.876000000 UTC +00:00}, {"version"=>"20230719101806", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:11.995000000 UTC +00:00}, {"version"=>"20230724064832", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:12.117000000 UTC +00:00}, {"version"=>"20230724064918", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:12.236000000 UTC +00:00}, {"version"=>"20230808070520", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:12.350000000 UTC +00:00}, {"version"=>"20230808140217", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:12.470000000 UTC +00:00}, {"version"=>"20231106202300", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:12.605000000 UTC +00:00}, {"version"=>"20231114142100", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:12.765000000 UTC +00:00}, {"version"=>"20231129062064", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:12.871000000 UTC +00:00}, {"version"=>"20231129062151", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:12.985000000 UTC +00:00}, {"version"=>"20231205104100", "active"=>1, "applied_at"=>Mon, 11 Dec 2023 15:57:13.096000000 UTC +00:00}, {"version"=>"20231205104101", "active"=>0, "applied_at"=>Mon, 11 Dec 2023 15:58:21.214000000 UTC +00:00}, {"version"=>"20231205112200", "active"=>0, "applied_at"=>Mon, 11 Dec 2023 15:58:21.096000000 UTC +00:00}]
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Execute /chatops run feature set run_clickhouse_migrations_automatically false
-
Set label changeaborted /label ~change::aborted
Monitoring
Look at the deploy logs first: specifically for Running gitlab:clickhouse:migrate:main rake task
line.
It should print something like this:
Running gitlab:clickhouse:migrate:main rake task
== 20230705124511 CreateEvents: migrating =====================================
== 20230705124511 CreateEvents: migrated (0.0203s) ============================
== 20230707151359 CreateCiFinishedBuilds: migrating ===========================
== 20230707151359 CreateCiFinishedBuilds: migrated (0.0148s) ==================
== 20230719101806 CreateCiFinishedBuildsAggregatedQueueingDelayPercentiles: migrating
== 20230719101806 CreateCiFinishedBuildsAggregatedQueueingDelayPercentiles: migrated (0.0114s)
== 20230724064832 CreateContributionAnalyticsEvents: migrating ================
== 20230724064832 CreateContributionAnalyticsEvents: migrated (0.0150s) =======
Key metrics to observe
- Metric: Deploy status
- Location: No idea
- What changes to this metric should prompt a rollback: Failed deploy
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.