Skip to content

GitLab Next

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
GitLab FOSS
GitLab FOSS
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
    • Iterations
  • Merge Requests 0
    • Merge Requests 0
  • Requirements
    • Requirements
    • List
  • Security & Compliance
    • Security & Compliance
    • Dependency List
    • License Compliance
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Code Review
    • Insights
    • Issue
    • Repository
    • Value Stream
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
  • GitLab.org
  • GitLab FOSSGitLab FOSS
  • Issues
  • #36791

Closed
Open
Opened Dec 30, 2016 by Valery Sizov@vsizovMaintainer

More reliable Sidekiq queue

Sidekiq uses BRPOP command to pop a job off the queue in Redis. It means that job is removed from the queue, usually, that's not a problem because if our job raises an error the Sidekiq puts it back with "failed" status(to another queue) and it will be retried later. But it becomes a problem when we kill sidekiq or when it segfaults or crashes. In this case, we lose that job forever.

This made me to take a look at how we kill our sidekiq processes. We use https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/sidekiq_middleware/memory_killer.rb for that. There is a variable SIDEKIQ_MEMORY_KILLER_SHUTDOWN_WAIT which means how long we wait from stopping accepting new jobs until we kill the process. So we give a job at maximum 30 seconds to finish its stuff. After this, we will kill that job and we'll never retry it (see above the reason). I think there is no reason to be so aggressive because sometimes our jobs exceed that time http://performance.gitlab.net/dashboard/db/sidekiq-workers. I propose to set it to 90 seconds.

The second optimization would be to increase the value of SIDEKIQ_MEMORY_KILLER_GRACE_TIME, which now is 15 minutes. That means that we wait 15 minutes after we write a warning to the log this thread will shut down PID #{Process.pid} - Worker #{worker.class} - JID-# in #{GRACE_TIME} seconds. Basically, with the current implementation, I don't see any reason to keep this fat process running. For now, I propose to set it to several seconds. Later, we can improve it by adding one more get_rss call after GRACE_TIME to make sure that memory was flooded by memory leaks, not by one fat job that can free resources afterward. That will make our shots more accurate. But I don't think the last one will have practical benefits in real life.

Summary

Now we have the following picture: process exceeds memory limit -> we write warning to the log -> wait 15 minutes -> stop accepting new jobs -> wait 30 seconds -> shutting everything down.

I propose to set SHUTDOWN_WAIT to 90 seconds (default 30) and GRACE_TIME to 10 seconds(now 15 minutes). What to expect?

  • Decreasing memory consumption by not keeping fat processes 15 minutes before killing.
  • More reliable queue by letting job more time to finish its stuff.

It should close issue https://gitlab.com/gitlab-org/gitlab-ce/issues/23646. And I'm also sure that everyone saw lost jobs, at least I saw many times.

/cc @jacobvosmaer-gitlab @pcarranza @jnijhof

REFERENCES:

https://github.com/mperham/sidekiq/wiki/Error-Handling

https://github.com/mperham/sidekiq/wiki/Reliability

Edited Sep 10, 2018 by Valery Sizov
Assignee
Assign to
Next 3-4 releases
Milestone
Next 3-4 releases
Assign milestone
Time tracking
None
Due date
None
Reference: gitlab-org/gitlab-foss#36791