Handle overlapping on-call shifts when persisting past rotations
What does this MR do and why?
Related issues:
- Changing restricted times on rotations fails if... (#329265 - closed)
- Investigate Monitor:Respond error budget (2023-... (#430793 - closed)
Context:
- For on-call rotations managed in gitlab, historical shifts are persisted to the database in a cronjob. (
IncidentManagement::OncallRotations::PersistShiftsJob
- There are 4 on-call rotations in production for which saving shifts fails with error
Shift timeframe cannot overlap with other existing shifts
; This is making the grouprespond error budget very🔴 - On-call rotations can have an 'active period' for which the values are timezone-dependent. When a the timezone of an on-call schedule is updated, the actual on-call shifts shouldn't be impacted. So we update the on-call rotation's attributes to make sure that stays true.
Changes in this MR:
Before | After | Why |
---|---|---|
When an on-call scheduled timezone is changed, shifts are saved using the new timezone & old rotation attributes (BUG). Attempt to save a shift with the new timezone & new rotation attributes. | When an on-call scheduled timezone is changed, shifts are backfilled using the old timezone & rotation attributes; Defer saving shifts with the new timezone & rotation params to the cronjob | Fixes root cause of bad data |
PersistShiftsJob generates shifts assuming the last saved shift and the next shift can't overlap | PersistShiftsJob trims the next shift to start when the last saved shift ends (if they overlap) | Ensure existing bad data gets fixed |
How to set up and validate locally
See #329265 (closed) for repro.
The gist:
- Create an on-call schedule in a project (
Monitor > On-call Schedules
) - Add a rotation with an active period
- Change the schedule's timezone between a bunch of different things & make sure it doesn't error
- After each change, run
IncidentManagement::OncallRotations::PersistShiftsJob.new.perform(IncidentManagement::OncallRotation.last.id)
from rails console to make sure it doesn't error
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #329265 (closed)
Edited by Sarah Yasonik