Geo discussion: Avoid long running jobs

Long running jobs:

extend the duration of zero-downtime updates
are likely to get killed ungracefully because a lot of care is needed to do it gracefully, e.g.
- when restarting the machine
- when restarting Sidekiq

In order to stop/restart Sidekiq without killing a job potentially in the middle of work, we send TSTP to Sidekiq and then wait for all existing jobs to finish.

And an ungracefully killed Sidekiq:

may orphan a lease
may cause bad data

Proposal

Don't add new jobs that run long (ideally exit in 25 seconds, but if not, then the shorter the better.)
~~If we introduce reusable code to handle scheduler-like situations, we should try to~~ (Edit: !22031 (merged) was merged.) Refactor existing schedulers away from running long.

To do

Wait for #121697 (closed) to be closed
Smoke test !22051 (closed) on an Omnibus installation if it is nearing merge
Open follow up issue to remove the feature flag, including enabling and verifying on a real installation (ops DR secondary?).

Edited Jan 03, 2020 by Michael Kozono