Skip to content

Handle overlapping on-call shifts when persisting past rotations

Sarah Yasonik requested to merge sy-fix-oncall-schedule-persistence into master

What does this MR do and why?

Related issues:

Context:

  • For on-call rotations managed in gitlab, historical shifts are persisted to the database in a cronjob. (IncidentManagement::OncallRotations::PersistShiftsJob
  • There are 4 on-call rotations in production for which saving shifts fails with error Shift timeframe cannot overlap with other existing shifts; This is making the grouprespond error budget very 🔴
  • On-call rotations can have an 'active period' for which the values are timezone-dependent. When a the timezone of an on-call schedule is updated, the actual on-call shifts shouldn't be impacted. So we update the on-call rotation's attributes to make sure that stays true.

Changes in this MR:

Before After Why
When an on-call scheduled timezone is changed, shifts are saved using the new timezone & old rotation attributes (BUG). Attempt to save a shift with the new timezone & new rotation attributes. When an on-call scheduled timezone is changed, shifts are backfilled using the old timezone & rotation attributes; Defer saving shifts with the new timezone & rotation params to the cronjob Fixes root cause of bad data
PersistShiftsJob generates shifts assuming the last saved shift and the next shift can't overlap PersistShiftsJob trims the next shift to start when the last saved shift ends (if they overlap) Ensure existing bad data gets fixed

How to set up and validate locally

See #329265 (closed) for repro.

The gist:

  1. Create an on-call schedule in a project (Monitor > On-call Schedules)
  2. Add a rotation with an active period
  3. Change the schedule's timezone between a bunch of different things & make sure it doesn't error
  4. After each change, run IncidentManagement::OncallRotations::PersistShiftsJob.new.perform(IncidentManagement::OncallRotation.last.id) from rails console to make sure it doesn't error

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #329265 (closed)

Edited by Sarah Yasonik

Merge request reports