Geo discussion: Avoid long running jobs
Long running jobs:
- extend the duration of zero-downtime updates
- are likely to get killed ungracefully because a lot of care is needed to do it gracefully, e.g.
- when restarting the machine
- when restarting Sidekiq
In order to stop/restart Sidekiq without killing a job potentially in the middle of work, we send TSTP to Sidekiq and then wait for all existing jobs to finish.
And an ungracefully killed Sidekiq:
- may orphan a lease
- may cause bad data
Proposal
- Don't add new jobs that run long (ideally exit in 25 seconds, but if not, then the shorter the better.)
-
If we introduce reusable code to handle scheduler-like situations, we should try to(Edit: !22031 (merged) was merged.) Refactor existing schedulers away from running long.
To do
-
Wait for #121697 (closed) to be closed -
Smoke test !22051 (closed) on an Omnibus installation if it is nearing merge -
Open follow up issue to remove the feature flag, including enabling and verifying on a real installation (ops DR secondary?).
Edited by Michael Kozono