Detect bad SQL in CI tests
Background:
We have, like many groups, regular issues with poorly performing SQL entering production. The reason this is hard to prevent is that a lot of ActiveRecord's power comes from its compositional style. But that makes it difficult to audit new code without knowing every call site and the context at that call site. On top of that many ActiveRecord methods such as includes
and preload
can cause interesting combinations of SQL to be used making even finding all the call sites difficult.
Our rspec tests already aim to exercise all these call sites meaning all the SQL we need to be concerned with should already be in the Rails logs and could be in the Postgres logs whenever the rspec tests are run. Some simple post-processing of those logs would allow any new SQL to be highlighted and explain plans to be reported.
Proposal 1 -- gather and post-analyze SQL from CI jobs:
We could leverage CI to gather a list of all SQL run during tests by simply enabling postgres logging of all queries in our CI tests and include those postgres logs as artifacts. If we enable CSV logs they'll be much easier to post-process than the Rails logs and we can be more certain that they include all SQL exactly as it's actually being executed.
Then we can provide a tool which downloads the sql log artifacts from two pipelines and reports on any new sql in the newer one. That seems fairly doable with the infrastructure we have today and would already provide a big boost to our ability to reliably audit any new sql in an MR.
Directions for future work:
-
When we have a mechanism to provide explain plans to developers from production we can even have that tool call out to that service directly and provide explain plans for the new sql. (There are some security issues to take care of to make this safe but they should be manageable.)
-
We could have a fairly simple set of policies about the explain plans (no sequential scans except on whitelisted tables, no index scans without index conds except on partial indexes, cost estimates larger than X, no sorts of estimated cardinality larger than Y, etc).
-
We could have a CI job which does all this automatically -- though I'm not sure how easily the inception-like examining of artifacts from earlier jobs would be...
Proposal 2 -- check and apply policies to SQL as they're run during tests
When running tests we can try to set up an environment which forces Postgres to generate plans similar to production, then detect plans that don't meet our policy goals and throw an error as soon as they're executed.
Pros:
- This would throw the errors close to the bad code rather than only after the full tests are run
- It would not be affected by superficial changes such as changing a column name which do not affect the plans
- It wouldn't increase the burden of reviewers
Cons:
- It wouldn't provide any visibility to new queries unless they triggered the heuristics
- It may be hard to create realistic plans in the test environment which may lead to false positives
- It may be hard to write effective heuristics
Implementation Details
To actually evaluate the policy it would be necessary to either have the Rails driver run explain
on every query and inspect the plan, or implement a Postgres extension which inspects the plan and throws a database error if it violates the policy.
An example policy would be:
- Run with
enable_seqscan=off
- If any sequential scans remain in the plan (except for whitelisted small tables such as
application_settings
) fail - If any index scans are in the plan which lack an
Index Cond
then fail (except for partial indexes) - If the total estimated cost is larger than X then fail
This would be a simple enough heuristic to implement in C in a Postgres extension or in Rails by examining the explain plan. More complex heuristics might be easier one way or the other.