corruption of Sidekiq cron job enqueue_time prevents Sidekiq from initializing - 'invalid date' logged during crash loop
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
This issue summarises the symptoms and 'fix' for a customer emergency.
-
Customer reported severe impact to GitLab functionality, and very large numbers of Sidekiq jobs queued
-
On checking Sidekiq (Admin area > Monitoring > Background jobs)
Busyshowed as zero -
Clicking through to this view, there were no sidekiq processes or jobs listed (Sidekiq is down)
-
Sidekiq was found to be in a crash loop on their Sidekiq virtual machines: the service in
gitlab-ctl statusshowing an uptime of 2-3 minutes or less. (Roughly the time it takes for Sidekiq to initialize the Rails stack) -
Sidekiq logs showed usual initialisation messages, such as:
{"severity":"INFO","time":"YYYY-MM-DDTHH:MM:SS.000Z","message":"Sidekiq x.y.z connecting to Redis with ptions ...followed by:
invalid date -
The top of the stack trace:
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:631:in `parse' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:631:in `rescue in parse_enqueue_time' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:628:in `parse_enqueue_time' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:290:in `initialize' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `new' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `block in all' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `collect' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `all' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:534:in `destroy_removed_jobs' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:177:in `load_from_hash!' /opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:9:in `load_cron_jobs!' /opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:113:in `block in <top (required)>' /opt/gitlab/embedded/service/gitlab-rails/vendor/gems/sidekiq-7.1.6/lib/sidekiq.rb:98:in `configure_server' /opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:41:in `<top (required)>' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.1/lib/rails/engine.rb:667:in `load' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.1/lib/rails/engine.rb:667:in `block in load_config_initializer' /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/activesupport-7.0.8.1/lib/active_support/notifications.rb:208:in `instrument' -
Observations
-
This is occurring during the
initializers, that is: when Sidekiq is booting. -
This can also be reproduced on the Rails console, for example, all three of these lines generate the same error.
current_job_names = Sidekiq::Cron::Job.all.filter_map { |j| j.name if j.source == "schedule" } current_job_names = Sidekiq::Cron::Job.all.map(&:name) Sidekiq::Cron::Job.destroy_all!
-
-
Conclusions from looking at this version of the upstream code (job.rb)
-
On initialisation, Sidekiq refreshes the cron jobs via the job names.
-
To populate
current_job_namesSidekiq::Cron::Job.allperformsSidekiq::Cron::Job.newon what's been found in Redis532 # Remove "removed jobs" between current jobs and new jobs 533 def self.destroy_removed_jobs new_job_names 534 current_job_names = Sidekiq::Cron::Job.all.filter_map { |j| j.name if j.source == "schedule" } 535 removed_job_names = current_job_names - new_job_names 536 removed_job_names.each { |j| Sidekiq::Cron::Job.destroy(j) } 537 removed_job_names 538 end
-
-
this results in
initializebeing called, and this attempts to determinelast_enqueue_time288 # Set last enqueue time - from args or from existing job. 289 if args['last_enqueue_time'] && !args['last_enqueue_time'].empty? 290 @last_enqueue_time = parse_enqueue_time(args['last_enqueue_time']) 291 else 292 @last_enqueue_time = last_enqueue_time_from_redis 293 end -
We don't need to know
last_enqueue_timeto perform (for example)destroy_all!- those records are all going to get destroyed anyway, but I guess it's part of creating objects for all the cronjobs.
Steps to reproduce
Unknown.
This might be random corruption of data written to Redis, or it might be a locale issue, though in that case it'd more likely to be reproducible and seen more widely. For example, there's an upstream PR related to Japanese formatting in datestamps: https://github.com/sidekiq-cron/sidekiq-cron/pull/354
Workaround
-
Stop Sidekiq and Puma on one node -
gitlab-ctl start sidekiq ; gitlab-ctl start puma -
Back up
job.rb# version specific directories replaced with * - this might need manual modification cd /opt/gitlab/embedded/lib/ruby/gems/*/gems/sidekiq-cron-*/lib/sidekiq/cron cp -a job.rb job.rb_bkp -
Modify
def initializeso@last_enqueue_timeis NOT set viaparse_enqueue_timebefore
# Set last enqueue time - from args or from existing job. if args['last_enqueue_time'] && !args['last_enqueue_time'].empty? @last_enqueue_time = parse_enqueue_time(args['last_enqueue_time']) else @last_enqueue_time = last_enqueue_time_from_redis endafter
# Set last enqueue time - from args or from existing job. #if args['last_enqueue_time'] && !args['last_enqueue_time'].empty? # @last_enqueue_time = parse_enqueue_time(args['last_enqueue_time']) #else @last_enqueue_time = last_enqueue_time_from_redis #end -
Start the Rails console on this node:
gitlab-rails console -
Remove all cron jobs (via this modified code)
Sidekiq::Cron::Job.destroy_all! -
Quit the rails console
-
Revert the code change
mv job.rb_bkp job.rb -
Start sidekiq:
gitlab-ctl start sidekiq; gitlab-ctl start puma
If there are other nodes running Sidekiq, you should find that they will have already initialized successfully and are now processing jobs.
Example Project
What is the current bug behavior?
If Sidekiq::Cron::Job.all fails (in this case for parse_enqueue_time but it could be a problem with other attributes) Sidekiq gets into state where it can neither start, nor can the cron jobs be removed as the code to do so depends on Sidekiq::Cron::Job.all
What is the expected correct behavior?
IF we find this is a broader problem we might need a rake task that can go direct to Redis and deal with the cron jobs. But for the moment, %Awaiting further demand
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
This occurred on 16.11
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)