Skip to content

GitLab Next

    • GitLab: the DevOps platform
    • Explore GitLab
    • Install GitLab
    • How GitLab compares
    • Get started
    • GitLab docs
    • GitLab Learn
  • Pricing
  • Talk to an expert
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    • Menu
    Projects Groups Snippets
  • Get a free trial
  • Sign up
  • Login
  • Sign in / Register
  • GitLab FOSS GitLab FOSS
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 0
    • Merge requests 0
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Code review
    • Insights
    • Issue
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar

GitLab 15.0 is launching on May 22! This version brings many exciting improvements, but also removes deprecated features and introduces breaking changes that may impact your workflow. To see what is being deprecated and removed, please visit Breaking changes in 15.0 and Deprecations.

  • GitLab.org
  • GitLab FOSSGitLab FOSS
  • Issues
  • #36791
Project 'gitlab-org/gitlab-ce' was moved to 'gitlab-org/gitlab-foss'. Please update any links and bookmarks that may still have the old path.
Closed
Open
Created Dec 30, 2016 by Valery Sizov@vsizovDeveloper

More reliable Sidekiq queue

Sidekiq uses BRPOP command to pop a job off the queue in Redis. It means that job is removed from the queue, usually, that's not a problem because if our job raises an error the Sidekiq puts it back with "failed" status(to another queue) and it will be retried later. But it becomes a problem when we kill sidekiq or when it segfaults or crashes. In this case, we lose that job forever.

This made me to take a look at how we kill our sidekiq processes. We use https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/sidekiq_middleware/memory_killer.rb for that. There is a variable SIDEKIQ_MEMORY_KILLER_SHUTDOWN_WAIT which means how long we wait from stopping accepting new jobs until we kill the process. So we give a job at maximum 30 seconds to finish its stuff. After this, we will kill that job and we'll never retry it (see above the reason). I think there is no reason to be so aggressive because sometimes our jobs exceed that time http://performance.gitlab.net/dashboard/db/sidekiq-workers. I propose to set it to 90 seconds.

The second optimization would be to increase the value of SIDEKIQ_MEMORY_KILLER_GRACE_TIME, which now is 15 minutes. That means that we wait 15 minutes after we write a warning to the log this thread will shut down PID #{Process.pid} - Worker #{worker.class} - JID-# in #{GRACE_TIME} seconds. Basically, with the current implementation, I don't see any reason to keep this fat process running. For now, I propose to set it to several seconds. Later, we can improve it by adding one more get_rss call after GRACE_TIME to make sure that memory was flooded by memory leaks, not by one fat job that can free resources afterward. That will make our shots more accurate. But I don't think the last one will have practical benefits in real life.

Summary

Now we have the following picture: process exceeds memory limit -> we write warning to the log -> wait 15 minutes -> stop accepting new jobs -> wait 30 seconds -> shutting everything down.

I propose to set SHUTDOWN_WAIT to 90 seconds (default 30) and GRACE_TIME to 10 seconds(now 15 minutes). What to expect?

  • Decreasing memory consumption by not keeping fat processes 15 minutes before killing.
  • More reliable queue by letting job more time to finish its stuff.

It should close issue https://gitlab.com/gitlab-org/gitlab-ce/issues/23646. And I'm also sure that everyone saw lost jobs, at least I saw many times.

/cc @jacobvosmaer-gitlab @pcarranza @jnijhof

REFERENCES:

https://github.com/mperham/sidekiq/wiki/Error-Handling

https://github.com/mperham/sidekiq/wiki/Reliability

Edited Sep 10, 2018 by Valery Sizov
Assignee
Assign to
Time tracking