Sidekiq: `StoreSecurityReportsWorker`: Optimise/reduce N+1 queries

Overview

In &5434 (closed) we identified StoreSecurityReportsWorker to significantly contribute to postgres primary pressure. There have been instances of this worker sending as many as 160k database requests in a single run, spending 400s in the database alone. Sidekiq workers exclusively contact the primary database server, which makes it difficult to absorb these queries. We have identified a number of N+1s originating from this worker, too, which in turn leads to runaway object allocations due to heavy use of ActiveRecord functions acting on objects one by one rather than in bulk. The worker has also contributed to production incidents.

We have made attempts at reducing the cardinality of queries/statements originating from this worker, but hit a wall due to complexity of the logic, and the depth at which some of these N+1s occur. I think we're at a point now where this is a big enough liability to consider:

Short-term: Turning off this worker for projects of significant size
Mid- long-term: Completely rework the way this worker, and its underlying service(s), are structured

Challenges

I spent the last few days drilling into the classes that make up this worker, and found 3 primary reasons for the inefficiencies observed, and why it is hard to address them:

1 - Work is being done in nested loops.

This isn't immediately obvious, but the way data is processed by this worker is as follows:

For_each report in reports(latest_builds(current_pipeline))
  For_each finding in report
    vulnerability_finding = find_or_create_vulnerability_finding_record(finding)
    update_or_create_relation_1(vulnerability_finding)
    ...
    update_or_create_relation_N(vulnerability_finding)
    create_vulnerability(vulnerability_finding)

This is a simplified view, but reflects roughly how the service objects operate. Relations that are created or updated in this cases are things such as statistics, remediations, fingerprints, finding links, and more. Each of these intermediate steps issues both reads and writes.

The algorithm is implemented in StoreReportService, which acts on a single report and its findings. It is this class that we should focus on.

2 - Reads are coupled with writes.

What is most problematic about the current implementation is that many of the individual steps in this algorithm rely on find_or_create_by or some variation of this AR helper. This method strongly couples reads and writes, which when used in any sort of loop is an immediate N+1. We should never use this function in a loop. We will have to break up this coupling by separating inputs to the above algorithm from its outputs, such that all necessary inputs are queried upfront in batches, any necessary updates to these data are made, and are then written back to the database in batches.

3 - Use of nested service objects.

There is a further complication to breaking up these coupled reads and writes: some of them are not even in StoreReportService, but are abstracted away behind other service objects such as Vulnerabilities::[Create|Update]Service, to which the store report service delegates. This makes it both difficult to see and address these N+1s.

Evidence

A snapshot showing some runs with more than 100,000 SQL statements issued:

A 1 week breakdown of how this worker contributes to query duration (LEFT) and query count (RIGHT) in the set of all Sidekiq workers that spend at least 30s in the database. The large two boxes at the bottom are StoreSecurityReportsWorker, accounting for 25% of time spent and 45% of queries sent:

Source: https://log.gprd.gitlab.net/goto/a5938d9c3948f503036d049c1e3e0069

Edited Mar 12, 2021 by Kamil Trzciński