[Production] Disable index_ci_builds_metadata_on_build_id
Production Change
Change Summary
We want to disable - not remove - index_ci_builds_metadata_on_build_id to verify that it's safe to delete without adverse performance impact. We know that we need to remove this index in order to partition ci_builds_metadata.
Once this change is done we should roll it back as we'll then delete the index entirely in gitlab-org/gitlab!98429 (merged).
According to Thanos it seems still being used though
| Old index | New index |
|---|---|
| Thanos query | Thanos query |
Change Details
- Services Impacted - ServicePostgres (and everything that uses that; here, ServiceWeb and ServiceSidekiq specifically)
- Change Technician - @steveazz
- Change Criticality - C2
- Change Type - changescheduled
- Change Reviewer - TODO
- Due Date - 2020-09-23
- Time tracking - TODO
- Downtime Component - None
Detailed steps for the change
patroni-ci
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 25.169 ms
-
Verify old index ( index_ci_builds_metadata_on_build_id) is chosen by the plannerEXPLAIN SELECT * FROM ci_builds_metadata WHERE build_id IN (12365354, 2454856);
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 0.126 ms
-
Disable old index ( index_ci_builds_metadata_on_build_id)update pg_index set indisvalid = false where indexrelid = (select oid from pg_class where relname = 'index_ci_builds_metadata_on_build_id');
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 0.048 ms
-
Verify new index ( index_ci_builds_metadata_on_build_id_partition_id_unique) is being used by postgres plannerEXPLAIN SELECT * FROM ci_builds_metadata WHERE build_id IN (12365354, 2454856); -
Verify the old index ( index_ci_builds_metadata_on_build_id) is invalid on replicasSELECT indisvalid FROM pg_index WHERE indexrelid = (select oid from pg_class where relname = 'index_ci_builds_metadata_on_build_id'); -
Check that jobs listing timings are not affected: https://dashboards.gitlab.net/d/web-rails-controller/web-rails-controller?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-controller=Projects::JobsController&var-action=All&query=pipeline%20 -
Old index ( index_ci_builds_metadata_on_build_id) no longer being used and thanos metrics drops to0 -
New index ( index_ci_builds_metadata_on_build_id_partition_id_unique) is being used and thanos metrics increases
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 0.133 ms
-
Re-enable index update pg_index set indisvalid = true where indexrelid = (select oid from pg_class where relname = 'index_ci_builds_metadata_on_build_id'); -
Verify that it's used again EXPLAIN SELECT * FROM ci_builds_metadata WHERE build_id IN (12365354, 2454856); -
Old index ( index_ci_builds_metadata_on_build_id) is being used and thanos metrics increases -
New index ( index_ci_builds_metadata_on_build_id_partition_id_unique) is less being used and thanos metrics decreases
Monitoring
Key metrics to observe
- Metric: Latency, SQL latency per controller request and slow/timeout queries on pipeline creation/viewing.
- Location: https://dashboards.gitlab.net/d/web-rails-controller/web-rails-controller?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-controller=Projects::JobsController&var-action=All&query=pipeline%20
- https://dashboards.gitlab.net/d/stage-groups-pipeline_execution/stage-groups-pipeline-execution-group-dashboard?orgId=1
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
No.
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue.) -
There are currently no active incidents.