Geo discussion: Avoid long running jobs

Long running jobs:

  • extend the duration of zero-downtime updates
  • are likely to get killed ungracefully because a lot of care is needed to do it gracefully, e.g.
    • when restarting the machine
    • when restarting Sidekiq

In order to stop/restart Sidekiq without killing a job potentially in the middle of work, we send TSTP to Sidekiq and then wait for all existing jobs to finish.

And an ungracefully killed Sidekiq:

  • may orphan a lease
  • may cause bad data

Proposal

  1. Don't add new jobs that run long (ideally exit in 25 seconds, but if not, then the shorter the better.)
  2. If we introduce reusable code to handle scheduler-like situations, we should try to (Edit: !22031 (merged) was merged.) Refactor existing schedulers away from running long.

To do

  • Wait for #121697 (closed) to be closed
  • Smoke test !22051 (closed) on an Omnibus installation if it is nearing merge
  • Open follow up issue to remove the feature flag, including enabling and verifying on a real installation (ops DR secondary?).
Edited by Michael Kozono