Some board list are not returned (500 error due to statement timeout)
Final update
We've now resolved the known root causes for this problem. We have a couple longer-term followups to improve the resilience of issue relative_positions (https://gitlab.com/gitlab-org/gitlab/-/issues/329357)
If you've been directed to this issue because you have a broken board, your group may need a manual rebalancing - please leave a comment in this issue if you're having problems.
Summary
History
- Our old
RelativePositioning
code was inefficient when shifting sibling elements to make space. It did this in a recursive manner which means a separateUPDATE
query for each element moved. #230953 - This was improved in !37724 (merged) but there were bugs: #238416 (closed).
- One of the bugs caused us to add issues to the end using the midpoint without the
IDEAL_DISTANCE
cap. Meaning, we quickly reach our max position after only a few issues. In normal cases, this would have taken millions of issues under a group to reach the max. - Once the max was reached, adding issues to the end meant shifting elements to the left every time. And the number of elements that need to be shifted increases. This then resulted to DB timeouts.
- We completed a few rebalances for gitlab-org and gitlab-com to unblock board lists (gitlab-com/gl-infra/production#3528 (closed), gitlab-com/gl-infra/production#3896 (closed), gitlab-com/gl-infra/production#3856 (closed)).
- We originally expected those manual rebalances to be long-term fixes but we quickly reached the max position again, and discovered another bug causing this. That bug is related to project imports and is being worked here: !59175 (merged) but it is not a simple fix and there is active discussion on the approach happening in the MR - because project templates use the same codepath, any time a new project is created from template there is an opportunity for this bug to manifest if the root project has high positions.
Original bug report:
When visiting this board: https://gitlab.com/groups/gitlab-org/-/boards/978615, my own list returns a 500 error due to statement timeout.
I found the relevant logs here: https://log.gprd.gitlab.net/app/kibana#/discover?_g=h@78a3ee2&_a=h@09ceaa2
Stacktrace:
ee/lib/gitlab/database/load_balancing/connection_proxy.rb:88:in `block in write_using_load_balancer',
ee/lib/gitlab/database/load_balancing/load_balancer.rb:84:in `block in read_write',
ee/lib/gitlab/database/load_balancing/load_balancer.rb:130:in `retry_with_backoff',
ee/lib/gitlab/database/load_balancing/load_balancer.rb:83:in `read_write',
ee/lib/gitlab/database/load_balancing/connection_proxy.rb:82:in `write_using_load_balancer',
ee/lib/gitlab/database/load_balancing/connection_proxy.rb:56:in `block (2 levels) in <class:ConnectionProxy>',
app/models/concerns/relative_positioning.rb:174:in `update_relative_siblings',
lib/gitlab/relative_positioning/item_context.rb:254:in `move_sequence',
lib/gitlab/relative_positioning/item_context.rb:216:in `move_sequence_before',
lib/gitlab/relative_positioning/item_context.rb:133:in `shift_left',
app/models/concerns/relative_positioning.rb:58:in `gap_size',
app/models/concerns/relative_positioning.rb:96:in `move_nulls',
app/models/concerns/relative_positioning.rb:34:in `move_nulls_to_end',
app/controllers/boards/issues_controller.rb:30:in `index',
...
This is quite annoying and prevents me from working efficiently with my board list. I think it started approximately on 2020-11-03.
Workaround steps:
We're working on resolving the root cause and rebalancing the groups so that this issue doesn't recur, but in the mean time, this workaround has been effective at getting board columns loading when things are in a bad state:
- Load up the issue list for the group you're having problems with. For example, this is the list for gitlab-com: https://gitlab.com/groups/gitlab-com/-/issues
- Filter the list by labels that match the board column causing you problems
- Change the sorting order to "Manual"
- Your newest issues are probably the problematic ones, so try dragging the order of those (just one row down or up). This may fail, in which case, try again, but if you're able to get the new issues to update their position, it should resolve the board column loading.
A simplified explanation of the issue is that when new issues are created we try to set their position in relation to other issues, the logic around this is complicated and has been failing with timeouts, which leads to the board loading problem. What you're trying to do in this workaround is manually force a position to be set.
Long term fix investigation
Moving the relative_position
column to a different table. Main reason: batch updates (rebalancing) in a narrow table is faster and doesn't block issues
.
Table structure:
Table "public.issues_relative_positions"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------------------+-----------------------+-----------+----------+-------------------------------------------------------+----------+--------------+-------------
id | bigint | | not null | nextval('issues_relative_positions_id_seq'::regclass) | plain | |
issue_id | bigint | | not null | | plain | |
project_id | bigint | | not null | | plain | |
relative_position | bigint | | | | plain | |
relative_position_str | character varying(12) | | | | extended | |
bucket | smallint | | | | plain | |
clumped_position | integer | | | 0 | plain | |
Note: relative_position_str
(lexorank) is pretty much the same as bucket
+ relative_position
.
Project level issues API (ordered by relative_position)
Current query + plan: https://explain.depesz.com/s/DZdH
Observations:
- Doesn't benefit from a good index, it reads N rows where N is the number of issues in the project
- index needed: (relative_position ASC NULLS LAST, id)
- Sorts in memory
New query + plan: https://explain.depesz.com/s/pXte
Observations:
- Similar problems + an extra JOIN that increases the I/O
Improved query + plan: https://explain.depesz.com/s/aMOQ
Observations:
- Reads maximum LIMIT * 2 rows
- It needs a hacky CTE which might not be needed with PG12
- This query might just work well: https://explain.depesz.com/s/ig2J
Group level issues API (ordered by relative_position)
Current query + plan: https://explain.depesz.com/s/Kmw2
Observations:
- Same problem as the project level
- Extra complexity is the recursive namespace lookup
- It's unlikely that the query will benefit from a specialized index (many project_ids due to the namespace lookup)
New query + plan: https://explain.depesz.com/s/tmif
Observations:
- Similar problems + an extra JOIN that increases the I/O
Note: I don't see an easy way to optimize this.
Update April 28, 2021
We have several MRs currently in review to resolve the remaining known root cause and allow us to successfully rebalance namespaces that are currently in a dirty state.
There are several interrelated issues causing problems:
- Mass creation of issues (most commonly by automations or imports) were enqueuing IssuePlacementWorker jobs at an untenable rate as outlined in this comment. This worker is responsible for setting the
relative_position
on newly created issues. We mitigated this in !60116 (merged). The impact of this improvement can be seen in this graph which shows that as of 2021-04-27 we are no longer enqueuing duplicate jobs. - We have one remaining known root cause for Issues getting positions near the max integer value. This means that namespaces can quickly get into a state where Issue placement will fail, which is what causes the board column problem. The fix for this is complicated, and has been through several rounds of review here: !59175 (merged) - this is nearly finished, but as @acroitor is headed out on PTO we may need to find someone to pick it up and carry it over the finish line.
- The result of the above bug is that namespaces end up in a state where issue positions are stacked up near the extremes of the integer range and thus issues cannot be positioned properly. Previously, we've been able to temporarily mitigate the problem by performing a manual rebalance on a namespace. However, because of the scale of the data in larger namespaces, this logic would timeout and during our most recent manual rebalance of gitlab-org and gitlab-com caused a db replication lag incident. We have an MR in progress here: !59744 (merged) to improve the resilience of this rebalancing while reducing its impact on the database. This should allow us to successfully rebalance larger namespaces without incident, and will give us a mechanism to get the currently dirty data back to a healthy state. This, combined with the root cause fix from above, should allow us to more permanently fix currently broken workflows and give us space to breathe while we implement longer term improvements to issue positioning.
- Once the above fixes are on production, we will need to work with infrastructure to attempt to rebalance the three largest namespaces (which we know are currently at the max integer position). I will open a change request to complete this once we've shipped the improvements to the rebalancing service.
- Beyond these fixes, we have more improvements to make to issue rebalancing, including a goal to move this outside of a transaction(https://gitlab.com/gitlab-org/gitlab/-/issues/329357) so that we can complete repositioning without locking, but we don't yet have a viable plan for how to do this. The above fixes should resolve the immediate problems and we can collaborate with the database team on how we might be able to make longer-term improvements.
To summarize:
We have a fix in progress for the last remaining known root cause of the bug. That bug leads to the data around issue positions being stuck in a problematic state. We have another fix in progress for the rebalancing service, which is what enables us to fix the problematic data. Once these two things are shipped, we can rebalance the namespaces that currently have data in a bad state and the problem will be resolved. There is always a possibility that there is another bug that is unknown to us right now that will put the data in a bad state again, but as of right now we expect the ongoing fixes to resolve the issue. @acroitor is headed out on PTO but is heroically trying to get these through review today. I will try to find someone who can carry these over the finish line in his absence if we can't get them wrapped up today.