Scoping work related to making faster queries
Carry over from https://gitlab.com/gitlab-com/infrastructure/issues/1786 which has a long list of SQL queries (output from pg_stat_statements_history
from PostgreSQL), this issue is about scoping how much work will be involved in reaching the OKR goal of
Database: Reduce the p99 of SQL timings across the board to 200 ms (100-200 ms less than what we have now).
- While https://gitlab.com/gitlab-com/infrastructure/issues/1786 lists slow queries, it does not reveal where in the application they are being called, reducing the usefulness of the list.
- @yorickpeterse added a new dashboard in our monitoring infrastructure (alas, InfluxDB data, so not yet viewable by the world) that shows SQL timings per controller, filtering out those with a p99 < 200 ms, and those that are called fewer than 500 times, and sorting by those controllers where the p99 of SQL timing is a high % of the p99 of the overall transaction timing for the controller.
- This yields a list of 36 controllers, 15 of which where SQL timing takes 47% or more of the total time; there is a drop to 29% for the next controller on the list.
- It makes sense to prioritize working on slow queries associated with the controllers identified in this manner.
- Each controller may very well call multiple queries (as seen in the example on https://about.gitlab.com/handbook/engineering/performance/#flow-of-web-request where loading the dashboard led to 29 SQL queries; not necessarily unique queries. It is not easy to know how many unique SQL queries are associated with these top 15 controllers.
- Each controller takes about 5 person-days to research, and fix. So we're looking at 75 person days, i.e. 1 full-time person to fix them all in Q3.
Edited by Ernst van Nierop