Disabling sidekiq scheduled cron job worker gets reenabled again

Insight

It all started, when we were experiencing slowness in DELETE operations with our own hosted Netapp as object store. (GitLab EE Premium 16.10.1 Omnibus setup)
The Ci::DeleteObjectsWorker had a long execution time as well as other cron jobs with DELETE operations:
image

Supporting evidence

The Netapp(s3) slowness lead to the sidekiq queue filling up with Ci::DeleteObjectsWorker, because the timeout transaction of s3 could not get worked off.
We created a separate queue with the following sidekiq settings:

sidekiq['routing_rules'] = [
  [
    'worker_name=Ci::DeleteObjectsWorker', # filter
    'ci_delete_objects' # queue group name
  ],
  # Wildcard matching, route the rest to `default` queue
  ['*', 'default'],
]

sidekiq['queue_groups'] = [
  'ci_delete_objects',
{% if ansible_processor_vcpus >= 4 %}
{% for n in range(ansible_processor_vcpus - 4) %}
  'default,mailers',
{% endfor %}
{% endif %}
  'default,mailers',
]

Creating these many queues caused memory issues for us, because the queues still got not worked off. So we changed the sidekiq settings to:

... see above
sidekiq['queue_groups'] = [
  'ci_delete_objects',
  'default,mailers',
  'default,mailers',
]

This way, jobs were piling up in the freshly created separate queue:
image

Knowing the main problem was the Ci::DeleteObjectsWorker, we looked towards disabling the spawning of this cron job.
https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/workers/ci/delete_objects_worker.rb

The Ci::DeleteObjectsWorker seems to be spawned by ci_schedule_delete_objects_worker cron job. image

We tried disabling the cron job via the Disable button in the UI, described here.

Unfortunately, the cron job got automatically reenabled after a few minutes and the jobs kept piling in the sidekiq queue.
Disabling the cron job via the Rails console yielded the same results:

job = Sidekiq::Cron::Job.find('ci_schedule_delete_objects_worker')
job.disable!
=> true
job = Sidekiq::Cron::Job.find('ci_schedule_delete_objects_worker')
job.status
=> "enabled"

We found the following comment referencing a self-healing mechanism

  ## Auxiliary jobs
  # Periodically executed jobs, to self-heal GitLab, do external synchronizations, etc.
  # Please read here for more information: https://github.com/ondrejbartas/sidekiq-cron#adding-cron-job
  cron_jobs:
    # Interval, in seconds, for each Sidekiq process to check for scheduled cron jobs that need to be enqueued. If not
    # set, the interval scales dynamically with the number of Sidekiq processes. If set to 0, disable polling for cron
    # jobs entirely.
    # poll_interval: 30

Which led us to conclude that the cron job gets restarted via a meta job to self-heal.

We tried modifying the schedule of the cron job in the /etc/gitlab/gitlab.rb file with:

gitlab_rails['ci_schedule_delete_objects_worker'] = "0 * 1 12 *"  # at 2024-12-01 00:00:00"

This change was not taking effect. Also the /var/opt/gitlab/gitlab-rails/etc/gitlab.yml did not reflect the above change.

After this, we wanted to switch back to local storage instead of object storage. Fortunately the situation with our s3 improved and we didn't have to.

Action

Assure this external object storage failure is taken into account by GitLab.
Provide a way to deal with object storage failure. Check documentation to accurately describe dealing with this external object storage failure.

Resources

Tasks

  • Assign this issue to the appropriate Product Manager, Product Designer, or UX Researcher.
  • Add the appropriate Group (such as ~"group::source code") label to the issue. This helps identify and track actionable insights at the group level.
  • Link this issue back to the original research issue in the GitLab UX Research project and the Dovetail project.
  • Adjust template link for this Issue type, because it leads to a 404: https://about.gitlab.com/handbook/product/ux/ux-research-training/research-insights/#actionable-insights