Need to rotate new IMs into the rotation and rotate others out. The current IMs have the best understanding of the experience. Collect feedback from current Incident Managers based on their experience and get feedback on potential change.
Next Actions
Collect feedback
Decide on any changes to the schedule/rotation - create MR (if needed)
Rotate new IMs into the rotation
Proposal
(would like to get some initial feedback on this before progressing to an MR)
One of the key parts of feedback already is that shift rotations are too frequent. Additionally, there are clearly busier shifts than others. See evaluation of number of PD events per shift per month.
Proposal tl;dr:
transition two of our shifts from 4 hours to 6 hours in duration. Result: eliminate 8 shift slots completely.
increase the number of IMs in each shift rotation from 4 to 6. Result: increase time between rotation shifts.
increase the duration of each set of shifts from 3 days to 4. Result: increase time between rotation shifts.
overall results:
with the same overall number of team members (24), shifts will occur less frequently. Instead having 9 days in between sets of shifts, there will be 19 days.
each set of shifts will be 4 days instead of 3
two shifts will be longer (6 hours) during time of day when there is less activity (from 2300 -> 0500 and 0500 -> 1100 UTC)
Need to rotate new IMs into the rotation and rotate others out.
@sloyd out of curiosity, how many eligible team members do we have for IMOC? My understanding is it's the total number of Directors / Sr EMs / EMs from Development and Infrastructure + some Staff Engineers.
I'm asking because I'm trying to understand how often team members will be rotated out if each rotation has 24 team members.
There are some details about some individual contracts as well, but in general your description is correct. The other item to consider are those who are in other oncall rotations - we don't want anyone doing more than one.
Taking the above into consideration this nets to approx 90 team members total for IM rotation.
I think shift swaps are easier among a smaller, more cohesive group (i.e. Reliability, or Infrastructure) than a bigger group (Infrastructure + Engineering), weirdly. I think when there's a bigger group there tends to be more of a bystander effect, but also less of an 'I know that person and they work on things related to me, I'll help them out' - which may be the same thing.
Hopefully the reduction in shift numbers will help, and same with onboarding more IMs, but I wonder if we need to come up with something to improve the swap situation. There are quite a few unfilled swap requests in #imoc_general (full disclosure, this includes some for me) and it's also quite hard to keep track of.
I'm not suggesting changing anything now as we're already making changes, but you did ask for misc feedback
@smcgivern thanks, this is great feedback. For the record, I have some shift coverage to ask for myself and haven't done so
I think one thing we can do here is to clarify an expectation that everyone who has completed onboarding should assist with coverage needs, this includes those not currently in the active rotation. I realize there can be some cost for an individual of catching up on current events, but believe that the benefit of easier coverage for those in active rotations would be worth that.
everyone who has completed onboarding should assist with coverage needs, this includes those not currently in the active rotation.
I think this is a good idea, and my hope is that the next rotation will have a better time of this because the current rotation will still be around picking up shifts.
@sloyd overall I think this is a positive change, as mentioned before I found current rotations too frequent and it impacted my plans more than I would wish.
I haven't been paged yet for an incident during my shifts yet - not sure based on what are taken "X"s in number of PD events per shift per month? Maybe this tracks also S3 and other incidents when IM is actually not paged? (ops, sorry, I was in a wrong tab - these are weeks with shifts, "Event volume" is the right tab - and the low number of incidents explains why I wasn't paged I think)
I found current rotations too frequent and it impacted my plans more than I would wish.
My experience so far as been that the shifts are short enough that I fail to plan properly for them and frequent enough that I try to just account for them in my overall capacity. Neither of these work well and I commonly end up tight for capacity on other, common tasks. So I'm supportive of a change to less frequent, longer shifts.
I wonder if we could have a 48-72 hour time period where everyone puts in the #imoc_general channel shifts they want to swap. Maybe we could do this June 28-July 1. By timeboxing this, it may make it easier to for people to swap shifts.
This would bring everyone's attention to looking at their calendar and get everyone comfortable with adjusting the schedule. We would continue to have one-off shift swapping that occurs in the chat room, but this may reduce some of the need for it.
I realized that the rotation system that Pageduty puts in places sometimes gives me Saturday shifts (I'm in APAC), but there are still people online in AMER timezones. Conversely, there are people in AMER that are getting Sunday shifts, when I'm online Monday in APAC. For me, I generally would rather pickup a later or earlier shift on a workday, than tie up a weekend day. Also, when we give people weekend shifts it creates a "time-in-lieu" liability for GitLab. Ultimately there is no way around some weekend shifts, but we may be able to reduce some of them further. I don't have a great solution for this other than maybe point 1 could help minimize this scenario.
have a 48-72 hour time period where everyone puts in the #imoc_general channel shifts they want to swap
@sethgitlab that is a great idea. I'll look to incorporate that into the plan.
On your other point, Pagerduty has long been considering some more features around full workforce management, but no specifics or timeline are available (to my knowledge). That would accomplish more of a sophisticated way to match team members preferences with available shifts. I don't have any recommendation or shorter term solution for this though.
I'm apprehensive about this, but it's balanced by more IMs in a rotation so the effect is that I will have 4 days shift follow by 20 days off-shift. This seems OK.
4 days increases everyone's likelihood of your having a weekend shift when your days come up, but the 19 days off will keep individuals from loosing multiple weekends in a row, like what happened recently with the temporary shift changes, resulting in the folks on weekends being on for the weekend 5 weeks in a row. With 19 days between, even some minor shift changes we shouldn't easily get into that situation again.
The current local shift time for me (Ireland) is incorrect, it should be 0800-1200 (rather than 9am - 1pm) and that's only because of daylight savings. In winter it will go back to 0700-1100.
That means the new shift would be 5am - 11am local time. This would also be true for UK folk too.
If it results in a page at 5am every now and then it's not a big deal. But any Slack notifications won't get to me at that time and a regular 5am start would be challenging.
@sloyd local times are not correct (all should be -1): | jprovaznik | 9am - 1pm | 7am - 1pm |
Expansion to earlier hours is fine by me (expanding to afternoon hours would be a problem), so overall this seems like a positive change if shifts will be less frequent.
@johnhope@jprovaznik thanks for noting the time mistake. I believe I've corrected it right now (not yet accounting for the DST switch)
It is a good point that within the few hours of difference within a given region the shift can be a lot different.
If it results in a page at 5am every now and then it's not a big deal. But any Slack notifications won't get to me at that time and a regular 5am start would be challenging.
Fully support this - the IM role isn't meant to require an active check-in at beginning of shift (unless it is convenient), but instead be ready to respond if paged. For example sake, my most frequent shift goes until midnight local time and I'm usually not directly engaged unless paged. I've also covered the shift that starts 4am local time (for me) and I didn't get up at 4am - but would have if I were paged.
I looked at all the pager duty events since Oct 2021 and in the 0500-0800 UTC period there were 7 events. So, there was some activity, but at least looking historically, this shouldn't result in a large % of early morning demands.
I'm okay with the rotation change as @sloyd, @kwanyangu, @jarv and I have all been in multiple rotations and have had many shifts overlap (and run into weekends) - there were 4 or 5 weekends in the last few months where I finished one 3 day set and started another 3 day alternate set. Having some more space between rotations and hopefully not having have us cover multiple rotations would be useful.
I looked at all the pager duty events since Oct 2021 and in the 0500-0800 UTC period there were 7 events. So, there was some activity, but at least looking historically, this shouldn't result in a large % of early morning demands.
@sloyd : Why have only a rotation of 24 instead spread the load over the whole 90? The workload is high on many (including @dcroft and I who do 3 shifts every 10 days).
This would also allow more team members to understand and appreciate incidents that impact our users (tangentially related to our goals to increase "user empathy journies")
@whaber with the concern addressed in the agenda regarding people who are not ready to fulfill the shift, do you recommend that we put the whole ~90 into the rotation? Similar to the dev-on call, this would effectively mean 1-2+ months between shifts. This seems not to short to be burned out but not too long to lose touch.
@whaber @m_gill The concern with this is that there would be too much time in between active shifts and that team members would need more effort each time to get back up to speed. However, the benefits of having a larger pool could certainly be more valuable.
Since the majority of staffing for IM role comes from Development, we should prioritize how you'd like your team members involved. I see from the meeting notes the leadership is all noting this should be tried.
I'll work up another proposal that is based on everyone in the pool at once. Note that the time in between shifts isn't expected to be equitable across the shifts - we'll still have fewer participants for those shifts aligned with APAC.
Also noted (in this issue description) a proposed schedule for implementing these changes. Proposing the expansion of the two adjacent shift to 6 hours starting on June 17 15, then the change to 4 day blocks and adding all eligible IMs on June 27.
Created this sheet with all team members meeting the description in the MR. Will need to get the preferences noted. @timzallmann@cdu1@sgoldstein @whaber @marin can you assist with this for all your team members? I started filling it in with everyone already on the schedule.
@sloyd thanks, that's exciting! How will this work for the existing shift swaps that some people have put in for July and August (not me; I am not organised enough).
I also deleted Nick Thomas and Andreas Brandl from that sheet as they no longer work here There might be others, too.
How will this work for the existing shift swaps that some people have put in for July and August
@smcgivern This will mess those up. Once the selections in the shift preferences are completed I'll map out what the schedule will look like for the next two months so that everyone can see it ahead of time.
We'll need to interrogate each of the existing overrides that are on or after June 27. I expect most of the existing overrides won't be needed, but that we'll have asks for new ones.
Once the selections in the shift preferences are completed I'll map out what the schedule will look like for the next two months so that everyone can see it ahead of time.
Thanks @sloyd, I'm glad my laziness is paying off!
I wanted to draw your attention to Incident Timelines, the MVC which is being released in 15.1. We are hopeful that this replaces the highly manual timeline that's in the incident template today, and make it easier to capture what happened during an incident and have the incident serve as the SSOT.
As you manage and help resolve the next incident, we'd love your feedback as you use the feature (please use gitlab-org&6376 (closed)).
We hope to officially incorporate the capability as part of our incident response process in the near future.
@kbychu This is very cool, and we probably should open a new issue to figure out how to dogfood this properly, since we probably don't want to take over this issue with a lengthy discussion.
It does look like there may be some experimentation with Incidents but as you can see most incidents are in Issues. It also looks like we probably need to turn on the feature flag for the timeline, as I'm not seeing the capability there yet.
***EDIT:
I realized after writing the above, that Incidents are actually a type of Issue, and we are in fact using Incidents. I've always accessed Incidents through Issues, so I did not realize they show up in two places Issues and Monitor>Incidents. It looks like the timeline feature could be dogfooded without changing much in our current process.
The next set of changes have been added to Pagerduty so that everyone call see the schedule ahead. It is important to move ahead to this step so that we can start to move on coverage where needed.
The updates to the schedule include:
renaming it to “Incident Manager - GitLab.com SaaS”
removed where some people were on multiple shifts, some up to 3. Nobody is on more than a single shift now.
Adjusted the shift blocks to 4 days instead of 3
Added new incident managers
As noted in the last point, new incident managers have been added. Some still need onboarding help (we’ll work on that).
There will still be more added, this is just the start.
Ask: Please look at the resulting schedule to ensure you are available, if you aren’t ask for coverage here in this channel. At this point only look until Aug 1. Past Aug 1 will still change as we add more team members to the rotation. Already the shifts are more than a month apart
When logging in to the pagerduty mobile app, I'm asked to select a Service Region (US login or EU login), and for SSO choose either US SSO Login or EU SSO Login.
What should APAC team members choose here? (or does it not really matter?)