Fix calculation of averages for instrumentation metrics
What does this MR do and why?
This merge request fixes the batch calculation of averages for instrumentation metrics. The fix is provided to mitigate an issue discovered with the implementation of batch_average
operation while investigating why ProtectedEnvironmentsRequiredApprovalsAverageMetric
metric was failing for in the Service Ping generated for GitLab.com (in %15.2).
Problem
TL;DR – We were summing up averages of each batch and returning that as a result, this was also breaking at times because nil
couldn't be coerced into BigDecimal
.
More details
-
merge_results
method was not written with theaverage
operation in mind. Unfortunately, this had slipped in !89913 (merged), and we ended up getting the average of each batch separately and then summing them all up, essentially returning a "sum of averages". -
Additionally, an average could possibly equal
nil
, resulting inTypeError (nil can't be coerced into BigDecimal)
(#367213 (closed)).
Solution
Instead of adding up more complexity to the implementation of BatchCounter
class (perhaps it helps to break it down into several, simpler classes), I opted for introducing a new class BatchAverageCounter
that calculates the sum of all batches first, and then divides the result by the total number of records that has non-null
values for that column.
Note: it is important to understand that batches
in the context of BatchCounter
does not simply mean multiple sets of records with same number of elements, but rather the elements in any of those sets is determined by starting from minimum(column)
and moving up by adding batch_size
until we reach the value of maximum(column)
. See the example in SQL queries section below.
Here is a quick rundown for why I choose this approach.
When I started working on this, a few ideas come up on how to calculate the average for batches:
- Average of averages: calculate average for batch, sum it all up, divide by number of batches.
- Weighted averages: calculate average for batch, give batch a weight, multiply average by weight, add them up.
The first option is statistically incorrect, batches will have different number of elements, so one batch could disproportionately affect the result. The weighted averages, on the other hand, would work fine for our use case. However, there was a simpler solution: we could sum the average of all batches, and then divide that number by the total count for that column.
I opted for the simple, almost boring solution here, but I'm happy to discuss other approaches if you believe something else works better.
Notes
While creating BatchAverageCounter
, I noticed a lot of extra stuff (e.g. methods, constants, variables) were unused or not needed (at least for the batch_average
use case), so I took the liberty of removing some of them, and adjusting others to match what this new class should do.
- Removed
merge_results
method in favour of calculating the sum of each batch directly (grouped relations are no longer supported). - Removed
start
andfinish
from the new class, as it does not really work for theaverage
use case. - Removed
mode
andcheck_mode
method because they are not needed foraverage
. - Removed
batch_size_for_mode_and_operation
method, again, because it is not needed. - Removed
unwanted_configuration
because the current implementation will not allow any of those configurations anyways!
Also removed not_group_by_query?
method from BatchCounter
class (dead code, never used).
Please let me know if any of them needs to be reinstated or shouldn't have been removed.
Next steps
As a follow up to this, we could possibly introduce a base class that both BatchCounter
and BatchAverageCounter
inherit from:
-
BatchCounter::Base
: for shared logic. -
BatchCounter::Generic
(rename fromBatchCounter
): forbatch_count
,batch_distinct_count
, andbatch_sum
. -
BatchCounter::Average
(rename fromBatchAverageCounter
): forbatch_average
. - Perhaps we could also move
sum
to its own class:BatchCounter::Sum
?
Resolves #367213 (closed).
SQL Queries
Below is an example of the SQL queries produced from running batch_average
(from my local GDK setup), you can see it returns same calculation as ActiveRecord's .average
method:
[17] pry(main)> Ci::Pipeline.average(:duration).to_f
(1.0ms) SELECT AVG("ci_pipelines"."duration") FROM "ci_pipelines"
=> 39.03669724770642
[12] pry(main)> Gitlab::Database::BatchCount.batch_average(Ci::Pipeline, :duration, batch_size: 200).to_f
(0.5ms) SELECT MIN("ci_pipelines"."duration") FROM "ci_pipelines"
(0.2ms) SELECT MAX("ci_pipelines"."duration") FROM "ci_pipelines"
(0.2ms) SELECT SUM("ci_pipelines"."duration"), COUNT("ci_pipelines"."duration") FROM "ci_pipelines" WHERE "ci_pipelines"."duration" >= 1 AND "ci_pipelines"."duration" < 201 LIMIT 1
(0.2ms) SELECT SUM("ci_pipelines"."duration"), COUNT("ci_pipelines"."duration") FROM "ci_pipelines" WHERE "ci_pipelines"."duration" >= 201 AND "ci_pipelines"."duration" < 401 LIMIT 1
(0.2ms) SELECT SUM("ci_pipelines"."duration"), COUNT("ci_pipelines"."duration") FROM "ci_pipelines" WHERE "ci_pipelines"."duration" >= 401 AND "ci_pipelines"."duration" < 601 LIMIT 1
(0.2ms) SELECT SUM("ci_pipelines"."duration"), COUNT("ci_pipelines"."duration") FROM "ci_pipelines" WHERE "ci_pipelines"."duration" >= 601 AND "ci_pipelines"."duration" < 607 LIMIT 1
=> 39.03669724770642
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.