Reduce LockManager LWLock Contention
## Problem Gitlab.com's database has had several incidents related to light weight lock saturation on replicas. While we've been able to mitigate these so far by adding replicas, our long term ability to do so will be limited by contention on the primary. As a result we need to pursue other options for reducing lightweight locks taken by queries. ## Proposals - [ ] Reduce number of indexes on high traffic tables. - [X] A number of indexes were removed: - https://gitlab.com/groups/gitlab-org/-/epics/11760+ - https://gitlab.com/groups/gitlab-org/-/epics/11729+ - https://gitlab.com/groups/gitlab-org/-/epics/11658+ - [X] We now have a Rubocop rule to prevent adding new indexes to some high traffic tables. - [ ] There are still a lot of indexes reported as redundant/duplicated that we should look at and remove if possible (https://gitlab.com/groups/gitlab-org/-/epics/10638). We may use GitLab Housekeeper to automate some of the process. - [x] Cache frequently executed queries - [X] Initial investigation was done in https://gitlab.com/gitlab-org/gitlab/-/issues/427770. - [X] Candidates for caching are listed in https://gitlab.com/gitlab-org/gitlab/-/issues/436078. - Status - https://gitlab.com/groups/gitlab-org/-/epics/11543#note_1833693271 - [ ] Replace frequently executed queries with prepared statements or pl/sql functions. - [X] Implemented for `#find` with https://gitlab.com/gitlab-org/gitlab/-/merge_requests/135196. - [ ] Rollout issue in progress - https://gitlab.com/gitlab-org/gitlab/-/issues/429479. - The above still leaves many PK look-ups unaffected. - [x] Vertically split tables so that the highest-traffic queries can be moved to tables with fewer than 16 indexes. - [X] PoC created and documented (https://gitlab.com/gitlab-org/gitlab/-/issues/430906), and this is something we look for and suggest for new features. - [ ] Applying this for existing features ideally should be done be the teams that own them. ## Challenges There is no obvious metric we can directly use to monitor and alert on. This makes it hard to reason about how much any of the mitigations already completed actually helps, and the utilization of this resource is not currently part of our capacity forecasting. There is a [`bpftrace` sctipt](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365238572) that we can run adhoc and collect some data, but due to the overhead it causes we can not run this for too long. Maybe this situation improved with the addition of `pg_wait_sampling` metrics to Prometheus (https://gitlab.com/gitlab-org/gitlab/-/merge_requests/135196#note_1640317075)? ## Other information At the moment, this issue mainly impacts our primary database.
epic