Optimize SQL queries sent by UpdateAllMirrorsWorker

Summary

Pull mirroring feature is implemented using a "pipeline" which consists of multiple Sidekiq workers. A number of different problems were identified with the feature, e.g. an inconsistent state leading to jobs being dropped: gitlab-com/gl-infra/production#7223 (comment 977020553) , wrong handling of mirroring in case of the wrong license: #216783 (closed) . Some of those problems have already been addressed. This issue, is about one of the problems that hasn't been addressed yet, but was identified during investigations, i.e. slow SQL queries during scheduling.

One of the workers, called UpdateAllMirrorsWorker, is responsible for scheduling mirroring jobs. It gets the state of mirroring from Postgres, from a table called project_mirror_data. The state is pulled from Postgres using multiple queries (it's not a single query or a single transaction, more like hundreds of queries). In multiple incidents we've seen that:

it can take in the order of tens of seconds for individual queries to execute
queries are slow enough to be hitting statement timeout on Postgres: https://log.gprd.gitlab.net/goto/88ca9490-08f9-11ed-8656-f5f2137823ba
it can take minutes for the worker to get the state from the database: gitlab-com/gl-infra/production#7420 (comment 1025123284)

At the moment of writing, the entire pull mirroring feature is in a much healthier state, so the timings and cancellations are not as bad as they were during incidents. However, there are still cases when the statements are canceled. There are also cases of long db durations, here are executions of UpdateAllMirrorsWorker which spent more than 50s in querying the database: https://log.gprd.gitlab.net/goto/3eabf890-08f9-11ed-8656-f5f2137823ba . If more users were to configure pull mirroring or we had again a backlog of jobs to process, these times would increase again.

Impact

we observed up to 5 mins delay between subsequent executions of UpdateAllMirrorsWorker (this adds to the overall delay between project mirroring)
if the SQL timeouts reach a high enough frequency, the worker won't be able to update the list of projects to be mirrored
the impact is global (not limited to a specific user or a group)

Recommendation

rethink the querying strategy (the way we get state from Postgres), perhaps using hundreds of queries is sub optimal or perhaps the opposite, maybe more queries which are lighter would bring down the overall time
SeqScan is used – Consider adding an index <https://docs.gitlab.com/ee/development/understanding_explain_plans.html#optimising-queries%7CShow details>
Query processes too much data to return a relatively small number of rows. – Reduce data cardinality as early as possible during the execution, using one or several of the following techniques: new indexes, partitioning, query rewriting, denormalization. See the visualization of the plan to understand which plan nodes are the main bottlenecks. <http://momjian.us/main/writings/pgsql/hw_performance/%7CShow details>
Specialized index needed – The index(es) currently used does not serve quite well for the needs of this query (notice Rows Removed by Filter: ..., meaning that the index fetched many non-target rows). Consider adding more specialized index(es). <https://postgres.ai/#tip-index-inefficient-high-filtered%7CShow details>

Extra details

From #368019 (closed)

Suggested in !92422 (comment 1026848136)
Follow-up for #216783 (closed)

Problem

We currently use a complex and slow query to select pull mirrors we need to update

Merge request !92422 (merged) added deactivation feature for mirrors that we don't update due to license check.

Proposal

It should be possible to remove check_mirror_plans_in_query? branch from the query and improve the performance.

Potential problems

The query without check_mirror_plans_in_query? is less restrictive and can select pull mirrors we haven't updated before. We need to evaluate the impact on customers.

Edited Jul 26, 2022 by Vasilii Iakliushin