service/usage ping high database load caused by ci_builds

Summary

A customer raised a ticket as their RDS burst balance was running low one day every week. This is a large customer, currently running GitLab 13.12.15. They isolated a particular Sidekiq worker and / or query that was causing it

/*application:sidekiq,correlation_id:xx,jid:yy,endpoint_id:GitlabUsagePingWorker*/
SELECT COUNT("ci_builds"."id") FROM "ci_builds"
WHERE "ci_builds"."type" = ? AND "ci_builds"."id" >= ? AND "ci_builds"."id" < ?

I would suspect this isn't the only query, as the graphs on the ticket show high datafile read IOPs for 30 minutes.

GitLab team members can find out more:

workaround

Support advised the customer to disable service ping. This issue with IOPs did not repeat the following week. The downside to this is GitLab does not get the data from the customer to inform future product direction.

Steps to reproduce

RDS PostgreSQL backed by GP2 storage which has an IOPs rating based on its size. AWS provides an article detailing more about GP2 and burst balances.

Example Project

What is the current bug behavior?

Service ping can generate sufficient database IO to exhaust a sizable amount of the burst credit.

The burst credit is always being replenished at the rate of 3 IOPS per GiB per second. Consider a daily ETL workload that uses a lot of I/O. For the daily job, gp2 can burst, and during downtime, burst credit can be replenished for the next day’s run.

So, this will replenish over time, but it'd be better to have this in reserve for business demands on GitLab.

Also, exhausting the credit would be bad, as the volume will drop to minimum IOPs, ie: it'll be throttled.

In the following example, I created an Amazon RDS instance with a 20 GiB gp2 volume. Such a volume bursts to 3000 IOPS. But once the burst is exhausted, it delivers only 100 IOPS, since 100 is the minimum IOPS. The point to stress here is that the small volume performs very well for this simulated nightly job. And then over the course of the next 12 hours, the burst credits accumulate in time for the next day’s nightly job.

What is the expected correct behavior?

Service ping should tread lightly.

Relevant logs and/or screenshots

See the ticket.

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

There are two suggested routes to fix this in the comment below

use CoutnerAttribute in fast growing tables but we may lose distinct_user counts with that the impact of which is TBD
Migrate this metric to a RedisHLL counter. We do not know what impact that would have on other metrics that use ci_build data so that would need investigation.

Proposal

From @grzesiek

The most simple way I can think of is to store an offset number for each ping service run with the number of builds previously calculated:

date	id	count
2023-05-25	x	1 milion
2023-05-26	y	1,200,000

Each run would not start from zero, but instead from a given ID, knowing that we everything prior the ID number had been counted yesterday. So in the next ping we send previous number + new number of builds created since the X ID.

Edited May 25, 2023 by Jackie Porter