Unblock gitlab org and gitlab com workflow by rebalancing issues relative positions
Production Change
Change Summary
The goal is to run the issue relative position rebalancing for gitlab-org and gitlab-com group to buy us some time and unblock workflows while we work on the root cause fix. There is more information about the overall problem in gitlab-org/gitlab#276483 (closed), and previous rebalances were completed in #3528 (closed) and #3892 (closed)
The operation is run in a transaction so it would either succeed of rollback in case of a timeout or exception. If the change is successful it should help unblock gitlab-org and gitlab-com users which currently encounter disfunction at issues level on boards.
Previous rebalances have been done off-hours on Saturday in order to target a time when load is low. I know the PG upgrade is this weekend so it's unlikely we would be able to do it this weekend, but it would be good to get this through as soon as possible as it's a pretty significant blocker for some workflows within GitLab.
Change Details
- Services Impacted - ~14817212
- Change Technician - @acroitor
- Change Criticality - C2
- Change Type - changeunscheduled
- Change Reviewer -
- Due Date - Date and time (in UTC) for the execution of the change
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - If there is a need for downtime, include downtime estimate here
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Rails console access -
Check rebalance_issues
FF is enabled -
Check issue_rebalancing_optimization
FF is enabled -
Prepare a modified version of the IssueRebalancingService
to allow for a bigger number of issues and smaller batch size.
class IssueRebalancingService
MAX_ISSUE_COUNT = 250_000
BATCH_SIZE = 50
end
...
-
Check that the limit change was applied
IssueRebalancingService::MAX_ISSUE_COUNT
-
Check issue count gitlab-org group
service = IssueRebalancingService.new(Project.find(278964).issues.take)
service.send(:issue_count)
-
Check issue count gitlab-com group
service = IssueRebalancingService.new(Project.find(7444821).issues.take)
service.send(:issue_count)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Run IssueRebalancingService
on gitlab-org group - ~20 mins.IssueRebalancingService.new(Project.find(278964).issues.take).execute
-
Run IssueRebalancingService
gitlab-com group - ~20 mins.IssueRebalancingService.new(Project.find(7444821).issues.take).execute
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Check there are no more issues with relative position = NULL on gitlab-org goup
Issue.relative_positioning_query_base(Project.find(278964).issues.take).where(relative_position: nil).count
-
Check there are no more issues with relative position = NULL on gitlab-com group
Issue.relative_positioning_query_base(Project.find(7444821).issues.take).where(relative_position: nil).count
-
Check the min and max relative positions on gitlab-org group
Issue.relative_positioning_query_base(Project.find(278964).issues.take).minimum(:relative_position)
Issue.relative_positioning_query_base(Project.find(278964).issues.take).maximum(:relative_position)
-
Check the min and max relative positions on gitlab-com group
Issue.relative_positioning_query_base(Project.find(7444821).issues.take).minimum(:relative_position)
Issue.relative_positioning_query_base(Project.find(7444821).issues.take).maximum(:relative_position)
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Auto rollbacks in case of a query timeout as it is being run in a transaction.
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.