Query for checking projects with mirrors has gotten slower

Background

In !27796 (merged), I changed the query which fetches mirrors to update to take the plan into account.

That means that instead of finding all projects that have mirrors created - even if they aren't on a plan that supports mirroring - we only find projects on the correct plans. Finding all projects led to a ~S3 production incident: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9587

Problem

When I created !27796 (merged), the query was reasonable performant. However, it appears to have gotten worse over time, which was noted in this ~S4 production incident: gitlab-com/gl-infra/production#2027 (closed)

gitlab-com/gl-infra/production#2027 (comment 333819686) shows the plan for the same query, just run at different points in time (with corresponding changes in the timestamp in the query).

Although the plan is broadly the same in both cases, we've gone from execution time of 177ms to execution time exceeding 2s.

https://thanos-query.ops.gitlab.net/graph?g0.range_input=4w&g0.max_source_resolution=0s&g0.expr=sum(rate(sidekiq_jobs_db_seconds_sum%7Benvironment%3D%22gprd%22%2C%20type%3D%22sidekiq%22%2C%20stage%3D%22main%22%2C%20queue%3D%22cronjob%3Aupdate_all_mirrors%22%7D%5B1h%5D))&g0.tab=0 shows that the DB duration of the UpdateAllMirrorsWorker - which runs this query - has been steadily increasing over time:

Possible solutions

Improve the query to be more specific

@ahegyi gave a very helpful analysis in gitlab-com/gl-infra/production#2027 (comment 333895196). Excerpted:

If I understand the query plan correctly, we had to scan 50k projects to get 500 matching projects (https://explain.depesz.com/s/wFMg)

Proposal:

Build a subselect from this query:
select project_id from project_mirror_data import_state 
where "import_state"."status" NOT IN ('scheduled', 'started')
AND next_execution_timestamp <= '2020-04-29 15:03:13.433262'
and import_state.retry_count <= 14 
order by next_execution_timestamp
LIMIT 500
projects = Project.where(complex_plan_query_conditions).where(id: subselect_query)
This will give you <=500 items that are matching the search conditions. You can easily get ~500 records if run this query in a loop with different offsets (pagination on the subselect).

Mark unlicensed mirrors as no longer mirrors

Another possible solution was mentioned by @reprazent in #212074 (comment 309322592):

What if, instead of still trying 15 times, we use the ProjectImportScheduleWorker to update the project so it is no longer marked as a mirror:
diff --git a/ee/app/workers/project_import_schedule_worker.rb b/ee/app/workers/project_import_schedule_worker.rb
index ac809cbbf8b..c895cb5c434 100644
--- a/ee/app/workers/project_import_schedule_worker.rb
+++ b/ee/app/workers/project_import_schedule_worker.rb
@@ -16,7 +16,11 @@ class ProjectImportScheduleWorker # rubocop:disable Scalability/IdempotentWorker
     raise ImportStateNotFound unless project&.import_state
 
     with_context(project: project) do
-      project.import_state.schedule
+      if project.mirror?
+        project.import_state.schedule
+      else
+        project.update(mirror: false)
+      end
     end
   end
 end
The Project#mirror? method is already overriding this attribute, so flipping this would make sure the database is in sync with the license.

This would cause that the mirror is no longer included in the query in UpdateAllMirrorsWorker, meaning we don't need to go over this project again in the next pass.

At the time, I didn't like that option because it meant upgrading wouldn't get mirroring working automatically without some extra work. But maybe it's the simplest option.

Additional considerations

We use similar queries in gitlab-exporter: https://gitlab.com/gitlab-org/gitlab-exporter/-/merge_requests/110 Any changes we make in the application should be reflected there. (gitlab-com/gl-infra/scalability#342 discusses making this kind of sampling the responsibility of the application.)