ci_build_needs.id column is an int4 primary key currently exceeding 50% of saturation. As a result, that key must be migrated to a big int to ensure we don't exhaust potential values.
Timeline
This process needs to be started ASAP - exhausting this key would cause serious harm to the running application.
Hey @marknuzzo! groupdatabase recently discovered that our alerting about Primary key exhaustion was broken, and we have a few tables that risk exhaustion. Among them was ci_build_needs, which is attributed to your group. I’ve added this issue to 15.10, but it takes a minimum of 3 releases to mitigate and we're hoping your team can get started on this as soon as possible in order to avoid an incident. I've linked to the process in the documentation and we're here to support your team in implementing this.
Thanks @lauraX - do you have an idea on weight here?
@dhershkovitch - Given the urgency noted above, I would lean towards this being the priority but please confirm before Laura takes the next steps on Monday.
@marknuzzo with this being a multi-milestone issue, should the milestone be set to when it should be completed (15.11) or started(15.9)? Otherwise the bot will mark it as missed:15.9 and missed:15.10"
Hi @lauraX - it's a great callout here as I'm not sure if we have had these scenarios come up in the past with it being multi-milestone. Since we know it will start in %15.9, I would vote with leaving as-is with %15.9 and as we go to %15.10 and %15.11, we remove the missed: labels as we go since it is expected to continue. WDYT? If we agree, I can also socialize on channel for awareness too.
@marknuzzo whatever works is fine by me! For transparency and clarity, I have updated the steps in the description with their proper milestones, and will use those milestones in the MRs.
Works for me - I just polished up the table above so the milestone is a dedicated column now but it's a great avenue to convey the multi-milestone effort. Thank you @lauraX!
Is this still the case, or can it wait for %15.10?
Totally reasonable question @dhershkovitch! We (groupdatabase) feel this is pretty urgent. Since the issue takes so long to mitigate, if the usage patterns change or spike we could be faced with serious production issues if not mitigated first.
@marknuzzo@dhershkovitch please note that this issue will still be a part of 15.10, although with the milestone being set when it started, it might not be visible in the board.
@marknuzzo@dhershkovitch I am going to finish the step 2 of this migration first, then go back to the spike, since this step 2 is timely and requires some collaboration with Max O.
Update: Due to a few changes in the process and some additional discussion, the second MR has been updated a few times, although it is in what I hope is final review since last week.
There is some context in the discussion in the MR plus additional context here. TLDR: We need to guard the migrations with !Gitlab.jh? && (Gitlab.com? || Gitlab.dev_or_test_env?), so the final second MR adds the guard to both previous migrations and the migration in the MR.