Retrospective on failures from ClearSharedRunnersMinutesWorker
Problem to solve
In https://gitlab.com/gitlab-org/gitlab-ce/issues/66679 and https://gitlab.com/gitlab-org/gitlab-ce/issues/65540 experienced a bug when resetting shared Runners minutes for customers on a monthly basis. This issue is intended to be a retrospective issue with observations and suggestions for improvements.
The minutes reset procedure had 2 primary issues:
- Some time it did not complete because of the query on
project_statistics
table started timing out due to the size of the table and the lack of index - We successfully recalculated the remaining additional minutes for the next month while failing to reset the minutes used. This caused customer to lose minutes on the next monthly reset.
Impacts
This problem impacted customers on Gitlab.com that use shared Runners minutes. Customers that had purchased additional minutes ended up losing a number of minutes. All customers might have experienced some mismatches in the usage quota either in the form of too many minutes used being shown or in the form of minutes from all project in the namespace not adding up to the namespace value.
Improvements
While investigating the issues, and then providing the fix, I've noticed some improvements we could make to the code to avoid certain issues from happening again. Here below a list of solutions that I consider sorted from the most important to the least important.
1. Ensure that changes to customer minutes are transactional
- recalculate the remaining additional minutes for all namespaces
- reset CI minutes notifications
- reset the minutes used to
0
for all namespaces - reset the minutes used to
0
for all projects
The problem with this algorithm is that any failure will have very large customer impact as it would leave a number of inconsistencies in the statistics that will cause customer losing money, confusion and make it also hard to troubleshoot from tech support / dev personnel.
With the last change we introduced some transactional changes but this should be expanded to the whole ClearSharedRunnersMinutesWorker
. For example, we should be able to do something like this:
Namespace.select(:id).each_batch do |namespaces|
Namespace.transaction do
Namespaces.where(id: namespaces).recalculate_extra_shared_runners_minutes_limit!
Namespaces.where(id: namespaces).reset_last_ci_minutes_notification!
NamespaceStatistics.where(namespaces: namespaces).reset_shared_runners_seconds!
ProjectStatistics.where(namespaces: namespaces).reset_shared_runners_seconds!
end
end
The above would ensure that changes will happen to all namespaces in the batch or none and would not leave inconsistent numbers.
2. Enable proper alerting for system level workers
The first bug (https://gitlab.com/gitlab-org/gitlab-ce/issues/65540) was filed after several customers had reported that their statistics where wrong. In reality the error encountered by ClearSharedRunnersMinutesWorker
was reported to Sentry (https://sentry.gitlab.net/gitlab/gitlabcom/issues/882609/?query=is:unresolved%20ClearSharedRunnersMinutesWorker) but it went unnoticed. Perhaps because Sentry considered the error to be seen by 0 Users
as there was no session data associated to it. In reality this was a system-wide failure and we should monitor these differently.
Maybe we can mark specific workers to be system-wide or critical
and we could send additional alerts (e.g. to infrastructure? a dedicated team?) + logging.
class ClearSharedRunnersMinutesWorker
include ApplicationWorker
system_worker!
# rest of the code
end
3. Always show the full list of projects in the project usage breakdown
some customers complained that on top of the namespace statistic eing wrong some projects where not even listed. In reality this is by design. We do not list public projects in the usage breakdown. I think we should always show all projects and instead, add explicitly that "this project is public and it does not contribute to the quota calculation": https://gitlab.com/gitlab-org/gitlab-ce/issues/65540#note_200800151