Skip to content

corruption of Sidekiq cron job enqueue_time prevents Sidekiq from initializing - 'invalid date' logged during crash loop

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

This issue summarises the symptoms and 'fix' for a customer emergency.

  • Customer reported severe impact to GitLab functionality, and very large numbers of Sidekiq jobs queued

  • On checking Sidekiq (Admin area > Monitoring > Background jobs) Busy showed as zero

  • Clicking through to this view, there were no sidekiq processes or jobs listed (Sidekiq is down)

  • Sidekiq was found to be in a crash loop on their Sidekiq virtual machines: the service in gitlab-ctl status showing an uptime of 2-3 minutes or less. (Roughly the time it takes for Sidekiq to initialize the Rails stack)

  • Sidekiq logs showed usual initialisation messages, such as:

    {"severity":"INFO","time":"YYYY-MM-DDTHH:MM:SS.000Z","message":"Sidekiq x.y.z connecting to Redis with ptions ...

    followed by:

    invalid date
  • The top of the stack trace:

    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:631:in `parse'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:631:in `rescue in parse_enqueue_time'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:628:in `parse_enqueue_time'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:290:in `initialize'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `new'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `block in all'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `collect'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `all'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:534:in `destroy_removed_jobs'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:177:in `load_from_hash!'
    /opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:9:in `load_cron_jobs!'
    /opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:113:in `block in <top (required)>'
    /opt/gitlab/embedded/service/gitlab-rails/vendor/gems/sidekiq-7.1.6/lib/sidekiq.rb:98:in `configure_server'
    /opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:41:in `<top (required)>'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.1/lib/rails/engine.rb:667:in `load'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.1/lib/rails/engine.rb:667:in `block in load_config_initializer'
    /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/activesupport-7.0.8.1/lib/active_support/notifications.rb:208:in `instrument'
  • Observations

    • This is occurring during the initializers, that is: when Sidekiq is booting.

    • This can also be reproduced on the Rails console, for example, all three of these lines generate the same error.

      current_job_names = Sidekiq::Cron::Job.all.filter_map { |j| j.name if j.source == "schedule" }
      current_job_names = Sidekiq::Cron::Job.all.map(&:name)
      Sidekiq::Cron::Job.destroy_all! 
  • Conclusions from looking at this version of the upstream code (job.rb)

    • On initialisation, Sidekiq refreshes the cron jobs via the job names.

    • To populate current_job_names Sidekiq::Cron::Job.all performs Sidekiq::Cron::Job.new on what's been found in Redis

      532       # Remove "removed jobs" between current jobs and new jobs
      533       def self.destroy_removed_jobs new_job_names
      534         current_job_names = Sidekiq::Cron::Job.all.filter_map { |j| j.name if j.source == "schedule" }
      535         removed_job_names = current_job_names - new_job_names
      536         removed_job_names.each { |j| Sidekiq::Cron::Job.destroy(j) }
      537         removed_job_names
      538       end
  • this results in initialize being called, and this attempts to determine last_enqueue_time

    288        # Set last enqueue time - from args or from existing job.
    289        if args['last_enqueue_time'] && !args['last_enqueue_time'].empty?
    290          @last_enqueue_time = parse_enqueue_time(args['last_enqueue_time'])
    291        else
    292          @last_enqueue_time = last_enqueue_time_from_redis
    293        end
  • We don't need to know last_enqueue_time to perform (for example) destroy_all! - those records are all going to get destroyed anyway, but I guess it's part of creating objects for all the cronjobs.

Steps to reproduce

Unknown.

This might be random corruption of data written to Redis, or it might be a locale issue, though in that case it'd more likely to be reproducible and seen more widely. For example, there's an upstream PR related to Japanese formatting in datestamps: https://github.com/sidekiq-cron/sidekiq-cron/pull/354

Workaround

  1. Stop Sidekiq and Puma on one node - gitlab-ctl start sidekiq ; gitlab-ctl start puma

  2. Back up job.rb

       # version specific directories replaced with * - this might need manual modification
    
    cd /opt/gitlab/embedded/lib/ruby/gems/*/gems/sidekiq-cron-*/lib/sidekiq/cron
    cp -a job.rb job.rb_bkp
  3. Modify def initialize so @last_enqueue_time is NOT set via parse_enqueue_time

    before

         # Set last enqueue time - from args or from existing job.
         if args['last_enqueue_time'] && !args['last_enqueue_time'].empty?
           @last_enqueue_time = parse_enqueue_time(args['last_enqueue_time'])
         else
           @last_enqueue_time = last_enqueue_time_from_redis
         end

    after

         # Set last enqueue time - from args or from existing job.
         #if args['last_enqueue_time'] && !args['last_enqueue_time'].empty?
         #  @last_enqueue_time = parse_enqueue_time(args['last_enqueue_time'])
         #else
           @last_enqueue_time = last_enqueue_time_from_redis
         #end
  4. Start the Rails console on this node: gitlab-rails console

  5. Remove all cron jobs (via this modified code)

    Sidekiq::Cron::Job.destroy_all!
  6. Quit the rails console

  7. Revert the code change

    mv job.rb_bkp job.rb
  8. Start sidekiq: gitlab-ctl start sidekiq; gitlab-ctl start puma

If there are other nodes running Sidekiq, you should find that they will have already initialized successfully and are now processing jobs.

Example Project

What is the current bug behavior?

If Sidekiq::Cron::Job.all fails (in this case for parse_enqueue_time but it could be a problem with other attributes) Sidekiq gets into state where it can neither start, nor can the cron jobs be removed as the code to do so depends on Sidekiq::Cron::Job.all

What is the expected correct behavior?

IF we find this is a broader problem we might need a rake task that can go direct to Redis and deal with the cron jobs. But for the moment, %Awaiting further demand

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

This occurred on 16.11

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

Edited by 🤖 GitLab Bot 🤖