Dropping partitions on p_duo_workflows_checkpoints is causing incidents owing to foreign keys on namespaces and projects
Summary
We have had multiple incidents on GitLab.com caused by heavyweight table locking when the partition management job repeatedly tries, fails, backs off when dropping the FK constraints from an obsolete partition of p_duo_workflows_checkpoints.
We've seen this with projects and namespaces.
Note
I have not reviewed the discussions on the incidents .. I don't know if it's just these constraints that should be dropped, or constraints for any tables.
Since the partitions are relatively short lived, it's not essential to have the cascading deletes from tables like projects, and so during earlier incidents it was concluded that the preventative fix is to stop creating those FK constraints for this partitioned table.
Stopping creating them explicitly on the partitions appears to have happened, but this is not sufficient: they're now being created implicitly by the constraint definitions on the parent table p_duo_workflows_checkpoints
Fix
-
Drop the foreign keys to namespaces and projects from p_duo_workflows_checkpoints Remove foreign key constraints to namespace_id... (!207591 - merged) -
Drop the foreign keys to duo_workflows_workflows from p_duo_workflows_checkpoints -
Add LFK for all 3
Steps to reproduce
Example Project
What is the current bug behavior?
APDEX violations on GitLab.com caused by heavyweight licks.
What is the expected correct behavior?
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Possible fixes
Patch release information for backports
If the bug fix needs to be backported in a patch release to a version under the maintenance policy, please follow the steps on the patch release runbook for GitLab engineers.
Refer to the internal "Release Information" dashboard for information about the next patch release, including the targeted versions, expected release date, and current status.
High-severity bug remediation
To remediate high-severity issues requiring an internal release for single-tenant SaaS instances, refer to the internal release process for engineers.