Consider optimizing query for usage ping SLA metric
The following discussion from !45092 (merged) should be addressed:
-
@seanarnold started a discussion: (+3 comments) @dstull Are you able to review the new database scope here please? If it's good, do you mind passing to the maintainer. Thanks!
@seanarnold this looks good from database side. The one note I have for maintainer review is that I'm curious if a partial index on project_incident_management_settings.sla_timer would help or not for the boolean field. It is less than 1_000 rows filtered in
database-lab
, but I'm not sure the rate of growth there and if it would help now or in the future.@iroussos can you review?
Thank you @seanarnold,
That's an interesting question @dstull that made me stop and think! The following response from stackoverflow covers it pretty nicely:
PostgreSQL will use an index only if it thinks it will be cheaper that way. An index on a Boolean column, which can only take two possible values, will almost never be used, because it is cheaper to sequentially read the whole table than to use random I/O on the index and the table if a high percentage of the table has to be retrieved.
An index on a Boolean column is only useful
- in data warehouse scenarios, where it can be combined with other indexes via a bitmap index scan.
- if only a small fraction of the table has the value TRUE (or FALSE for that matter).
I can see from our production data that only 3 out of 907 records of
project_incident_management_settings
havesla_timer = TRUE
@seanarnold If the following are true, then I think that it would be beneficial to add a partial index:
What we have in production right now is representative of the distribution that we expect to see as this table grows.
If not and the distribution is more close to the 20-80% range (rough estimates from the top of my head - we can check them if necessary), then I think that the optimizer will either way pick a Seq Scan.
We expect this table to grow significantly in GitLab.com or large self hosted instances (if I get this correctly, the maximum potential size of this table is the same as that of
projects
which is at 15M+ in production, but that doesn't mean that there aren't other assumptions I don't know about)If not, then it is not worth it to add an index for a table with a couple thousand rows maximum.
The index that I would add in that case would also facilitate the join (your query) through an Index only scan:
ON project_incident_management_settings USING btree (project_id) WHERE sla_timer = TRUE;
@seanarnold other than the above, everything else looks good to me!
If you don't mind, check my comment and if there is no reason to add such an index, send the MR back to me to merge it, otherwise check the index I mentioned and how the queries behave. (a similar distribution would be required for the test dataset, as a random one will result to a Seq Scan even if an index is there)