Drop index_ci_builds_on_queued_at index from ci_builds

Production Change

Change Summary

Because of #6821 (closed) we are not sure if dropping any index from ci_builds using a regular database migration would cause a similar incident.

In gitlab-org/gitlab!91470 (comment 1014323157) it was suggested to drop the index via a CR issue because it can be executed when there isn't any autovacuum process on the ci_builds table and then the regular migration would be turned into a no-op for production. We know that this index is no longer used by the application:

https://thanos.gitlab.net/graph?g0.expr=sum(rate(pg_stat_user_indexes_idx_tup_read%7Benv%3D%22gprd%22%2C%20indexrelname%3D%22index_ci_builds_on_queued_at%22%7D%5B5m%5D))&g0.tab=0&g0.stacked=0&g0.range_input=4w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=sum(rate(pg_stat_user_indexes_idx_tup_fetch%7Benv%3D%22gprd%22%2C%20indexrelname%3D%22index_ci_builds_on_runner_id%22%7D%5B5m%5D))&g1.tab=0&g1.stacked=0&g1.range_input=2w&g1.max_source_resolution=0s&g1.deduplicate=1&g1.partial_response=0&g1.store_matches=%5B%5D&g2.expr=sum(rate(pg_stat_user_indexes_idx_scan%7Benv%3D%22gprd%22%2C%20indexrelname%3D%22index_ci_builds_on_token_partial%22%7D%5B5m%5D))&g2.tab=0&g2.stacked=0&g2.range_input=2w&g2.max_source_resolution=0s&g2.deduplicate=1&g2.partial_response=0&g2.store_matches=%5B%5D

List of relations  
 Schema |             Name             | Type  |  Owner  |   Table   | Size  | Description   
--------+------------------------------+-------+---------+-----------+-------+-------------  
 public | index_ci_builds_on_queued_at | index | joe_292 | ci_builds | 67 GB |   
(1 row)

Change Details

Services Impacted - ServicePostgres
Change Technician - DRI for the execution of this change
Change Reviewer - DRI for the review of this change
Time tracking - 1 day
Downtime Component - None

Detailed steps for the change

Change Steps - steps to take to execute the change

Set label changein-progress /label ~change::in-progress

patroni-ci

Estimated Time to Complete (mins) - 11s

Verify that there isn't any autovacuum to prevent wraparound process running on ci_builds:

      SELECT pid, application_name, client_addr, state, age(clock_timestamp(), query_start) as duration, query
      FROM pg_stat_activity
      WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' AND state != 'idle' AND age(clock_timestamp(), query_start) > '00:01:00'
      ORDER BY age(clock_timestamp(), query_start) DESC;

If autovaccum not running continue the procedure otherwise abort.

Drop the index_ci_builds_on_queued_at index

      DROP INDEX CONCURRENTLY "index_ci_builds_on_queued_at";

Verify the index no longer exists
```
\di+ index_ci_builds_on_queued_at;
```

patroni

Estimated Time to Complete (mins) - 11s

Verify that there isn't any autovacuum to prevent wraparound process running on ci_builds:

      SELECT pid, application_name, client_addr, state, age(clock_timestamp(), query_start) as duration, query
      FROM pg_stat_activity
      WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' AND state != 'idle' AND age(clock_timestamp(), query_start) > '00:01:00'
      ORDER BY age(clock_timestamp(), query_start) DESC;

If autovaccum not running continue the procedure otherwise abort.

Drop the index_ci_builds_on_queued_at index

      DROP INDEX CONCURRENTLY "index_ci_builds_on_queued_at";

Verify the index no longer exists
```
\di+ index_ci_builds_on_queued_at;
```
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Rollback not necessary as we no longer need this index

Estimated Time to Complete (mins) - ~ 24 hours

Recreate the index using:

      CREATE INDEX CONCURRENTLY index_ci_builds_on_queued_at ON ci_builds USING btree (queued_at);

Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

This index was with our previous queuing strategy which no longer exist so we can remove it safely, gitlab-org/gitlab!91473 (comment 1014635883).

Metric: 500 error rates
- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1 Hide charts
- What changes to this metric should prompt a rollback: increase in errors related to pipeline creation/job token authentication
Metric: Sequential reads on ci_builds
- Location: https://thanos.gitlab.net/graph?g0.expr=rate(pg_stat_user_tables_seq_scan%7Benvironment%3D%22gprd%22%2Crelname%3D%22ci_builds%22%7D%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: increase

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Jul 21, 2022 by Steve Xuereb - Out of Office Back 2025-01-03