Drop index_ci_builds_on_queued_at index from ci_builds
Production Change
Change Summary
Because of #6821 (closed) we are not sure if dropping any index from ci_builds
using a regular database migration would cause a similar incident.
In gitlab-org/gitlab!91470 (comment 1014323157) it was suggested to drop the index via a CR issue because it can be executed when there isn't any autovacuum process on the ci_builds
table and then the regular migration would be turned into a no-op for production. We know that this index is no longer used by the application:
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+------------------------------+-------+---------+-----------+-------+-------------
public | index_ci_builds_on_queued_at | index | joe_292 | ci_builds | 67 GB |
(1 row)
Change Details
- Services Impacted - ServicePostgres
- Change Technician - DRI for the execution of this change
- Change Reviewer - DRI for the review of this change
- Time tracking - 1 day
- Downtime Component - None
Detailed steps for the change
Change Steps - steps to take to execute the change
-
Set label changein-progress /label ~change::in-progress
patroni-ci
Estimated Time to Complete (mins) - 11s
-
Verify that there isn't any autovacuum to prevent wraparound process running on ci_builds
:SELECT pid, application_name, client_addr, state, age(clock_timestamp(), query_start) as duration, query FROM pg_stat_activity WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' AND state != 'idle' AND age(clock_timestamp(), query_start) > '00:01:00' ORDER BY age(clock_timestamp(), query_start) DESC;
-
If autovaccum
not running continue the procedure otherwise abort. -
Drop the index_ci_builds_on_queued_at
indexDROP INDEX CONCURRENTLY "index_ci_builds_on_queued_at";
-
Verify the index no longer exists \di+ index_ci_builds_on_queued_at;
patroni
Estimated Time to Complete (mins) - 11s
-
Verify that there isn't any autovacuum to prevent wraparound process running on ci_builds
:SELECT pid, application_name, client_addr, state, age(clock_timestamp(), query_start) as duration, query FROM pg_stat_activity WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' AND state != 'idle' AND age(clock_timestamp(), query_start) > '00:01:00' ORDER BY age(clock_timestamp(), query_start) DESC;
-
If autovaccum
not running continue the procedure otherwise abort. -
Drop the index_ci_builds_on_queued_at
indexDROP INDEX CONCURRENTLY "index_ci_builds_on_queued_at";
-
Verify the index no longer exists \di+ index_ci_builds_on_queued_at;
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Rollback not necessary as we no longer need this index
Estimated Time to Complete (mins) - ~ 24 hours
-
Recreate the index using: CREATE INDEX CONCURRENTLY index_ci_builds_on_queued_at ON ci_builds USING btree (queued_at);
-
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
This index was with our previous queuing strategy which no longer exist so we can remove it safely, gitlab-org/gitlab!91473 (comment 1014635883).
- Metric: 500 error rates
- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1 Hide charts
- What changes to this metric should prompt a rollback: increase in errors related to pipeline creation/job token authentication
- Metric: Sequential reads on
ci_builds
- Location: https://thanos.gitlab.net/graph?g0.expr=rate(pg_stat_user_tables_seq_scan%7Benvironment%3D%22gprd%22%2Crelname%3D%22ci_builds%22%7D%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: increase
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.