service/usage ping high database load caused by ci_builds
Summary
A customer raised a ticket as their RDS burst balance was running low one day every week. This is a large customer, currently running GitLab 13.12.15. They isolated a particular Sidekiq worker and / or query that was causing it
/*application:sidekiq,correlation_id:xx,jid:yy,endpoint_id:GitlabUsagePingWorker*/
SELECT COUNT("ci_builds"."id") FROM "ci_builds"
WHERE "ci_builds"."type" = ? AND "ci_builds"."id" >= ? AND "ci_builds"."id" < ?
I would suspect this isn't the only query, as the graphs on the ticket show high datafile read IOPs for 30 minutes.
GitLab team members can find out more:
workaround
Support advised the customer to disable service ping. This issue with IOPs did not repeat the following week. The downside to this is GitLab does not get the data from the customer to inform future product direction.
Steps to reproduce
RDS PostgreSQL backed by GP2 storage which has an IOPs rating based on its size. AWS provides an article detailing more about GP2 and burst balances.
Example Project
What is the current bug behavior?
Service ping can generate sufficient database IO to exhaust a sizable amount of the burst credit.
The burst credit is always being replenished at the rate of 3 IOPS per GiB per second. Consider a daily ETL workload that uses a lot of I/O. For the daily job, gp2 can burst, and during downtime, burst credit can be replenished for the next day’s run.
So, this will replenish over time, but it'd be better to have this in reserve for business demands on GitLab.
Also, exhausting the credit would be bad, as the volume will drop to minimum IOPs, ie: it'll be throttled.
In the following example, I created an Amazon RDS instance with a 20 GiB gp2 volume. Such a volume bursts to 3000 IOPS. But once the burst is exhausted, it delivers only 100 IOPS, since 100 is the minimum IOPS. The point to stress here is that the small volume performs very well for this simulated nightly job. And then over the course of the next 12 hours, the burst credits accumulate in time for the next day’s nightly job.
What is the expected correct behavior?
Service ping should tread lightly.
Relevant logs and/or screenshots
See the ticket.
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
There are two suggested routes to fix this in the comment below
- use
CoutnerAttribute
in fast growing tables but we may losedistinct_user
counts with that the impact of which is TBD - Migrate this metric to a RedisHLL counter. We do not know what impact that would have on other metrics that use
ci_build
data so that would need investigation.
Proposal
From @grzesiek
The most simple way I can think of is to store an offset number for each ping service run with the number of builds previously calculated:
date | id | count |
---|---|---|
2023-05-25 | x | 1 milion |
2023-05-26 | y | 1,200,000 |
Each run would not start from zero, but instead from a given ID, knowing that we everything prior the ID number had been counted yesterday. So in the next ping we send previous number + new number of builds created since the X ID.