For the MVC, people will only be able to escalate to an already-created schedule (see proposal here). We should follow this MVC with the additional option to escalate to a specific user.
Release notes
When creating the escalation policy for alerts, it is useful to be able to email a specific user. In this release, we added the ability for teams to define exactly whom to contact, even if that specific user is not part of the on-call rotation.
@crystalpoole - here's the issue for allowing the escalation to a specific user. While not part of the MVC for escalation policies, it feels like it should probably be a "fast-follow" to it. Maybe something to consider for 14.1?
This should be part of the first batch of improvements after the MVC, so let's schedule it in 14.1 for now and I can push it out if our priorities change.
@ohoral My backend MR above is functional, but needs more work to be ready.
If we release the backend first then people can use the API before the frontend is ready, which is the risk here.
It's probably not a big deal so long as they both go out in the same release.
While looking through the backend MR for this, it seems the data model for this is getting really complicated. The idea of adding a user to an escalation rule results in a lot of different code paths to manage, since the escalation rule is the starting point of a lot of different behavior on the backend. So it means supporting either a user or a schedule on a lot of different objects.
I'm really wondering if this is the direction we should be going at all.
Some thoughts I have on this:
If there are scenarios in which an individual needs to be notified directly, it's usually because that individual is head of engineering or a particular team. It's not actually that it matters who they are, but which role they represent. That sounds a lot like a schedule with a single participant to me.
If it's a matter of convenience for setting up policies in that flow, we could start by allowing a user to be selected at rule-creation time, then simply set up a schedule with a single participant called 'Direct to ' or something. We'd have a lot fewer code-paths by re-routing the notification at the schedule-level than within the escalation-flow.
this would also facilitate hand-offs at a scheduled date & time in case of reorganization
this would let the user have overrides in case of vacations, etc
with this direction, we could even look into ways to make the user-based-schedule a first-class citizen of sorts?
this would avoid the issue of "what do we do with users removed from the project?" since we've already addressed it via schedules
the very simplest version of this approach could actually even be frontend -only
An alternate proposal
Add support for an always option for rotation lengths
This would go along with a requirement of only a single person on the rotation
From the EscalationRule UI, allow users to search/add a direct-to-user escalation rule, but auto-setup an oncall schedule for that user in UTC, with a rotation that they are always on
If you view the saved escalation rules or edit, you'll see the schedule created based on the user, rather than the user token
If that rendering option doesn't work, we could allow an OncallSchedule to be associated with an individual user (only in the DB), and conditionally render the user token/Email user dropdown menu option from the policy modal/etc. This'd also let us sort the on-call schedule index view so that the user-associated schedules would also be listed at the end if we wanted, since they'll likely have somewhat infrequent interactions
@ohoral@crystalpoole - any thoughts on this? It's not my intention to slow this issue down with @seanarnold & @ameliabauerly OOO, but I just want to make sure we're not taking our data model/UX down a path that's really difficult to maintain.
Thanks for this alternate proposal, @syasonik. I feel very hesitant to change the design at this point. I'm not convinced that the always/direct to option will be intuitive to users. It's an approach that hasn't been tested or validated, so it feels pretty risky to me to move forward with it. On the other hand, we did test the proposal we are implementing thus far, so I do feel confident people will be able use it as expected/intended.
I'd also note that these are just the first of two options we originally discussed adding to this dropdown menu. I know that Opsgenie, for example, also offers the option to route to the next on-call user in schedule, all members of a team, admins of a team, etc, in addition to a schedule and to a single user. So, with this first addition to that dropdown, we're setting ourselves up for the possibility of expanding the escalation options, in future, based on what people need. It feels a little worrisome if the model is already so complicated that we can't expand on it. Does that mean all future additions to that dropdown will also be off the table?
The design as is does feel like the right approach from an experience POV in that it's straightforward and tested well. So I guess I wonder if there anything else that can be done to help address the data model concerns that doesn't impact the experience so heavily and that will allow us to continue to expand upon the escalation options in future?
If not, and if it's actually unlikely we'll be able to expand the options within this menu going forward, I think we'd be better off not trying for a complicated UI workaround and just force people who need this workflow to create single user schedules from the get-go, which I believe was your original preferred approach. If we decide to go this route instead, we can rework the escalation policy modal so this dropdown is just text rather than a field since we'll only ever have one option available. That feels a bit limiting this early on in the game but, if that's what's going to be realistically maintainable for us, then we can adjust the UI to better reflect that reality.
It's an approach that hasn't been tested or validated, so it feels pretty risky to me to move forward with it. On the other hand, we did test the proposal we are implementing thus far, so I do feel confident people will be able use it as expected/intended.
I'd also note that these are just the first of two options we originally discussed adding to this dropdown menu. I know that Opsgenie, for example, also offers the option to route to the next on-call user in schedule, all members of a team, admins of a team, etc, in addition to a schedule and to a single user. So, with this first addition to that dropdown, we're setting ourselves up for the possibility of expanding the escalation options, in future, based on what people need. It feels a little worrisome if the model is already so complicated that we can't expand on it. Does that mean all future additions to that dropdown will also be off the table?
This is ridiculously helpful context! 😀 Thanks for all the detail. We don't need to rule out any future additions at all - even knowing the intention for the feature can make it much easier to "see" possible database architectures, if that makes sense.
I think we can definitely continue with the UX plan as-is (as well the API in !65208 (closed)), but I have a few backend-related ideas I'll ruminate on over the weekend and try to work in on Monday.
Sorry for the noise and pulling you away from OOO @ameliabauerly! I really appreciate the perspective. 💯🙏🙇
No worries at all! I appreciate your raising these points, @syasonik, and for ensuring we have a solution that is maintainable in the future 💪
This is ridiculously helpful context! 😀 Thanks for all the detail. We don't need to rule out any future additions at all - even knowing the intention for the feature can make it much easier to "see" possible database architectures, if that makes sense.
Okay, that's really great to know. Let us know if you have ideas for improving things here. I know we're sort of setting the stage for everything that will follow so it's super important we figure these details out, and get them right. Thank you for your close attention and care with this, I really appreciate it! 🙇
The use case you talked about with notifying roles is a good one that I can see people using. Like you said this can be achieved with regular schedules and overrides.
I’m not sure I agree with the premise that the DB model is getting too complicated, so I’m keen to see what your alternative solution might be 😃.
it looks like the discussion resolved by this time and as frontend is almost ready this change would be very unexpected at least. 😰
Yeah, the frontend MR won't need to change - sorry for the noise @ohoral
I’m not sure I agree with the premise that the DB model is getting too complicated, so I’m keen to see what your alternative solution might be 😃.
@seanarnold My main concerns are 1) too many optional attributes and 2) data duplication.
Many optional attributes can lead to surprising behavior and missed edge cases. A lot of data duplications means we're having to keep a lot of info up-to-date across tables, which leads to lower maintainability (the BE MR has 6 migrations which all introduce complexity - and we haven't yet introduced incident escalations). In total, we'll have a lot of edge cases in a lot of repeated locations.
Modifications I'm hoping to make to the current backend MR
Especially knowing that there could be more types of rules in the future, I think it makes sense to:
Make the rule_id required on Escalations (w/ cascading delete)
Remove schedule_id from Escalations entirely (& possibly status as well? depending on query performance, I think)
Retain a triggered_at (or comparable) on Alert/IssueEscalationStatus
On changes to escalation policies/rules, reset Escalations according to triggered_at, the new rule definitions, and the current time
I think we'd talked about this^^ approach initially, but decided to duplicate attributes instead, thinking it would be simpler for handling policy updates. Between the complexity of this MR & the previous MR for adding system note for escalations, I'm thinking we might have made a misstep.
However, walking back that decision would mean we could offload a lot of conditionals onto the rule, containing the surface area of optional-attribute-logic. It would remove the data duplication as well, since the rule model would instead act as the single point which routes the flow, instead of dividing it between the rule and the escalation. And if we make the changes now, we lower the impact of the data for us & users.
Why I was thinking that the alternate proposal would improve the data model
The goal of the counter proposal in #330648 (comment 617899773) was to take the same idea of a single notification routing point, and contain it to the schedule, rather than the escalation rule. Since the schedule already acts as the "router" for rotations/shifts/soon-to-be-overrides, that would make some sense. But it would also mean the rule & escalation models could stay unchanged. More types of escalations rules would likely push us away from this approach, though.
We could potentially make "types of schedules" look like "types of rules" and keep the API/UI as-is while routing everything to (hidden) schedules anyways, but that'd probably just lead to communication breakdowns more an anything else and not ultimately improve anything.
Likely future direction of the data model
The other idea I was thinking about was breaking up EscalationRule into multiple subtypes - ie) ScheduleEscalationRule, UserEscalationRule, RoleEscalationRule etc. This approach would elimniate the bulk of the optional attributes, but would make the notification routing piece more complex. So it's honestly overkill for our current needs, though that may change down the line as escalation rules grow.
the BE MR has 6 migrations which all introduce complexity
The migrations are pretty simple, just adding a column each to two tables, and removing a null constraint. Our DB migration policies which cause them to be split into multiple MRs.
Duplicate data
In general I don't disagree with you, however there could be some reasons to keep it duplicated:
If an escalation policy is modified, existing alerts should follow the original escalation rules
If we are ok with not doing that, that's fine but I wanted to raise we're changing behaviour here too.
Any complexity with optional attributes can be contained in a shared model concern, and shared between the two models.
On changes to escalation policies/rules, reset Escalations according to triggered_at, the new rule definitions, and the current time
Can you elaborate on what you mean by this? I assume you mean delete Escalations that are no longer valid. Would you also create escalations row which might now be valid under the rules -- I'm not sure that's required or expected. I'd expect it'd only affect new alerts.
Likely future direction of the data model
The other idea I was thinking about was breaking up EscalationRule into multiple subtypes - ie) ScheduleEscalationRule, UserEscalationRule, RoleEscalationRule etc
Yeah I was thinking this too. We might be able to get away with keeping them in a single table, and just applying different validation rules and whatnot based on a type without going down the rabbithole of a full STI🤔.
Overall I'm ok either way, as long as we acknowledge we're changing behaviour now for escalations before a rule edit.
I see you've gone ahead and started on this transition !65635 (merged)👍 .
We could perhaps add do it in the reverse too, with adding the ability to mention a user and then removing the column & changing logic to use the rule only if that's how we want to go. It might unblock us easier, however probably not worth the extra work to do that.
The migrations are pretty simple, just adding a column each to two tables, and removing a null constraint. Our DB migration policies which cause them to be split into multiple MRs.
@seanarnold Oh, totally agreed! I don't think those migrations themselves are particularly complicated. It's that the resultant data structure gives me pause.
If an escalation policy is modified, existing alerts should follow the original escalation rules
If we are ok with not doing that, that's fine but I wanted to raise we're changing behaviour here too.
That was a LOE-based decision. So I think we're ok to take it either direction and long as we put it in the docs correctly.
Any complexity with optional attributes can be contained in a shared model concern, and shared between the two models.
I think it's basically N+1 models (number of pending escalation tables plus escalation rules). Shared code can help, but I'm not loving the prospect of maintaining that. When we make future changes to either escalation rules or escalations, it'll mean updating any/all of those tables. That's a lot of places for things to go wrong when the feature set is so new. 🤔
Can you elaborate on what you mean by this? I assume you mean delete Escalations that are no longer valid. Would you also create escalations row which might now be valid under the rules -- I'm not sure that's required or expected. I'd expect it'd only affect new alerts.
When an escalation policy is updated, I'm thinking that we'll have 3 outcomes:
Changes to rules for a policy
Impact to pending escalations
Existing rules that are being deleted
Corresponding escalations will be cascading deleted
Existing rules that are left unchanged
No impact
New rules
A pending escalation will be created for each alert which still has pending escalations
So this would roughly mean:
We wouldn't create pending escalations for alerts created before policies were created
We wouldn't create pending escalations for alerts which haven't been resolved/acknowledged, but were "triggered" > 24 hours ago or have already "completed" the configured escalation policy
We could perhaps add do it in the reverse too, with adding the ability to mention a user and then removing the column & changing logic to use the rule only if that's how we want to go. It might unblock us easier, however probably not worth the extra work to do that.
Yeah, I think it wouldn't been great to have technical planning time for this issue in advance to iron out the data structures & ordering in advance of starting work... oh well 🤷
A pending escalation will be created for each alert which still has pending escalations
Oh yeah that's a nice to have. Like I said I wouldn't expect that to happen as a user, but it might add some delight 😃. If it's tricky to implement we could potentially drop that, but by the looks of your MR(s) you have it under control 💪.
Oh yeah that's a nice to have. Like I said I wouldn't expect that to happen as a user, but it might add some delight
Hmm, I'd interpreted https://gitlab.com/gitlab-org/monitor/monitor/-/issues/56#note_505162527 to mean that it is required that that either the old escalation policy or the new policy is followed when the policy is updated for alerts which were created after the policy. Otherwise we'd basically be "dropping" notifications from the user perspective? I dunno. Maybe I misunderstood... Either way, it's implemented now so 🤷
@syasonik or @ohoral - in #334739 (comment 616386938) we discussed the situation where the user who is involved in the escalation policy is deleted. We have an existing warning message for this scenario for on-call schedules:
I imagine we'd want an additional bullet added to this confirmation modal if the user is also listed as part of an escalation policy. Do we need a separate issue for adding this bullet? Or, has that work already been completed as part of this issue? Wasn't sure so wanted to double check. Happy to do whatever makes most sense :)
Is there an environment where I could go to look at this feature?
I am not able to find documentation around this change. Did I miss it? Or is this something I could help with so we can reference the documentation in the release post?
@abellucci I was also unable to find documentation for this. I just added the documentation label to this issue.
@syasonik@ohoral You can add the docs for this in the existing Escalation Policies doc. It doesn't look like it will take much--we just need to cover how to configure an escalation policy to email a specific user. Just let me know if you need any assistance 🙂
Is there an environment where I could go to look at this feature?
Yes @abellucci! You can check out monitor-sandbox on staging (to which I've just added you as a Maintainer) or tanuki-inc on production. These are the projects that ~"group::monitor" generally uses to test/demo features. (There's a monitor-sandbox on production too, which is used in the same way, but less frequently.)
@syasonik - Thanks! I overlooked the option to toggle right after the THEN statement.
. Thanks for putting together an MR for the documentation too. ;)