Optimize compliance violations query

Problem to solve

The default query used to fetch compliance violations in the compliance report has the following issues:

The current query plan using recursive CTE shows it takes 679.987 ms and processes ~5.70 GiB to return 20 rows.
The elastic logs show the current query using transversal ids to be quite slow (~2.4s) to return a result.
- Note: The use_traversal_ids feature flag is enabled on gitlab.com.

As a result a database maintainer recommended we first improve the current query before we add additional filters (&7879 (closed)) that reduces performance even more.

Additional information

Default query parameters used in the compliance report:

Date range: 30 days
Sort: SEVERITY_LEVEL_DESC
Limit: 20

Recommendations

As noted by @krasio in this comment, the compliance violations query suffers from performance problems with IN queries:

If we create a index on merge_requests_compliance_violations (merge_request_id, severity_level DESC, id DESC) we can apply the solution described in the above guide. This will only for sorting on severity_level DESC, id DESC, other orders will need their own index. Not sure how easy (or even possible) will it be to combine this solution with GraphQL.

The gist of it will look something like

Gitlab::Pagination::Keyset::InOperatorOptimization::QueryBuilder.new(
  scope: MergeRequests::ComplianceViolation.order(order),
  array_scope: MergeRequest::Metrics.where(target_project_id: Group.find(22).all_projects.select(:id), merged_at: (Time.now - 1.month).beginning_of_day..Time.now.end_of_day).select(:merge_request_id),
  array_mapping_scope: -> (id_expression) { MergeRequests::ComplianceViolation.where(MergeRequests::ComplianceViolation.arel_table[:merge_request_id].eq(id_expression)) },
  finder_query: -> (severity_level_expression, id_expression) { MergeRequests::ComplianceViolation.where(MergeRequests::ComplianceViolation.arel_table[:id].eq(id_expression)) }
).execute.limit(20)

Which will produce query like the https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/10199/commands/36162. 🙀 This all is to improve the current query. Then we can think of applying filter by protected branches (the goal of this MR), which may or may not work.

With the optimization the results improve from 679.987 ms to 200.187 ms, and data processed from ~5.70 GiB to ~246.10 MiB.

Considerations

Before we consider using InOperatorOptimization we should answer the following:

Is it compatible with GraphQL resolvers?
We'd need to create 8 indexes, one for each SORT type, is that feasible?
Is this approach compatible with optional filters?
Would it support additional filters like branch names or status &7879 (closed) that add additional IN queries?

Implementation plan

...

Edited May 24, 2022 by Jiaan Louw