Skip to content

Fix calculation of averages for instrumentation metrics

What does this MR do and why?

This merge request fixes the batch calculation of averages for instrumentation metrics. The fix is provided to mitigate an issue discovered with the implementation of batch_average operation while investigating why ProtectedEnvironmentsRequiredApprovalsAverageMetric metric was failing for in the Service Ping generated for GitLab.com (in %15.2).

Problem

TL;DR – We were summing up averages of each batch and returning that as a result, this was also breaking at times because nil couldn't be coerced into BigDecimal.

More details 👇🏼

  1. merge_results method was not written with the average operation in mind. Unfortunately, this had slipped in !89913 (merged), and we ended up getting the average of each batch separately and then summing them all up, essentially returning a "sum of averages".

  2. Additionally, an average could possibly equal nil, resulting in TypeError (nil can't be coerced into BigDecimal) (#367213 (closed)).

Solution

Instead of adding up more complexity to the implementation of BatchCounter class (perhaps it helps to break it down into several, simpler classes), I opted for introducing a new class BatchAverageCounter that calculates the sum of all batches first, and then divides the result by the total number of records that has non-null values for that column.

Note: it is important to understand that batches in the context of BatchCounter does not simply mean multiple sets of records with same number of elements, but rather the elements in any of those sets is determined by starting from minimum(column) and moving up by adding batch_size until we reach the value of maximum(column). See the example in SQL queries section below.

Here is a quick rundown for why I choose this approach.

When I started working on this, a few ideas come up on how to calculate the average for batches:

  • Average of averages: calculate average for batch, sum it all up, divide by number of batches.
  • Weighted averages: calculate average for batch, give batch a weight, multiply average by weight, add them up.

The first option is statistically incorrect, batches will have different number of elements, so one batch could disproportionately affect the result. The weighted averages, on the other hand, would work fine for our use case. However, there was a simpler solution: we could sum the average of all batches, and then divide that number by the total count for that column.

I opted for the simple, almost boring solution here, but I'm happy to discuss other approaches if you believe something else works better.

Notes

While creating BatchAverageCounter, I noticed a lot of extra stuff (e.g. methods, constants, variables) were unused or not needed (at least for the batch_average use case), so I took the liberty of removing some of them, and adjusting others to match what this new class should do.

  • Removed merge_results method in favour of calculating the sum of each batch directly (grouped relations are no longer supported).
  • Removed start and finish from the new class, as it does not really work for the average use case.
  • Removed mode and check_mode method because they are not needed for average.
  • Removed batch_size_for_mode_and_operation method, again, because it is not needed.
  • Removed unwanted_configuration because the current implementation will not allow any of those configurations anyways!

Also removed not_group_by_query? method from BatchCounter class (dead code, never used).

Please let me know if any of them needs to be reinstated or shouldn't have been removed.

Next steps

As a follow up to this, we could possibly introduce a base class that both BatchCounter and BatchAverageCounter inherit from:

  • BatchCounter::Base: for shared logic.
  • BatchCounter::Generic (rename from BatchCounter): for batch_count, batch_distinct_count, and batch_sum.
  • BatchCounter::Average (rename from BatchAverageCounter): for batch_average.
  • Perhaps we could also move sum to its own class: BatchCounter::Sum?

Resolves #367213 (closed).

SQL Queries

Below is an example of the SQL queries produced from running batch_average (from my local GDK setup), you can see it returns same calculation as ActiveRecord's .average method:

[17] pry(main)> Ci::Pipeline.average(:duration).to_f
   (1.0ms)  SELECT AVG("ci_pipelines"."duration") FROM "ci_pipelines"
=> 39.03669724770642
[12] pry(main)> Gitlab::Database::BatchCount.batch_average(Ci::Pipeline, :duration, batch_size: 200).to_f
   (0.5ms)  SELECT MIN("ci_pipelines"."duration") FROM "ci_pipelines"
   (0.2ms)  SELECT MAX("ci_pipelines"."duration") FROM "ci_pipelines"
   (0.2ms)  SELECT SUM("ci_pipelines"."duration"), COUNT("ci_pipelines"."duration") FROM "ci_pipelines" WHERE "ci_pipelines"."duration" >= 1 AND "ci_pipelines"."duration" < 201 LIMIT 1
   (0.2ms)  SELECT SUM("ci_pipelines"."duration"), COUNT("ci_pipelines"."duration") FROM "ci_pipelines" WHERE "ci_pipelines"."duration" >= 201 AND "ci_pipelines"."duration" < 401 LIMIT 1
   (0.2ms)  SELECT SUM("ci_pipelines"."duration"), COUNT("ci_pipelines"."duration") FROM "ci_pipelines" WHERE "ci_pipelines"."duration" >= 401 AND "ci_pipelines"."duration" < 601 LIMIT 1
   (0.2ms)  SELECT SUM("ci_pipelines"."duration"), COUNT("ci_pipelines"."duration") FROM "ci_pipelines" WHERE "ci_pipelines"."duration" >= 601 AND "ci_pipelines"."duration" < 607 LIMIT 1
=> 39.03669724770642

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Ahmed Hemdan

Merge request reports