We have multiple day breached FRT tickets that do not get the level of visibility needed to prioritise them.
What is the problem?
For a variety of reasons, we sometimes have new tickets that go multiple days without getting a first response comment made to them.
Why is this a problem?
Having a ticket go multiple days without a response is a very poor customer experience and can lead to escalations (STARs) and frustration for both customers and team members.
Proposal
Create an alert with a list of all new tickets that are older than 2 days with no response. This alert will go to managers who will be responsible for finding an assignee for the ticket and prioritise a response to the ticket.
A Very Breached ticket is any ticket that has gone for more than 2 days with no first response made on it.
The alert is triggered with an attached list of tickets daily.
The alert is made in Slack in either the #spt_gg_forest or #spt_managers channel and [at]mentions all managers globally OR
The alert is made as a STAR digest (i.e. a single STAR entry with a link to all Very Breached tickets) in the #support_ticket-attention-requests channel and [at]mentions the regional on-call managers
Support Operations help may be needed to automate the list of very breached tickets and implement the alerting solution.
Potential Roadblocks/Things to consider
This issue is focused on FRT tickets only for now and does not including NRT very breached tickets.
This alert is a safety mechanism since we have failed to respond timely once a ticket has breached. There could be many reasons why this fail state has occurred, but the focus for this issue is to introduce a safety mechanism and not dig deeper into the underlying causes of why a breached ticket has gone multiple days without a response. That investigation should take place in another issue.
Once we have deployed and refined this alert, we can use another issue to define a similar approach for NRT tickets. That is out of scope for this issue.
Desired Outcome
Managers are alerted on a regular basis for tickets that are very breached.
What does success look like?
Once the list of very breached tickets has been given to managers, all tickets on the list get an assignee and a first response as a priroity.
How do we measure success?
All tickets included in the alert are assigned and have a first response made within 1 day.
Regarding the implementation ideas in the Proposal section:
The regions are taking a couple of different approaches around managers correlating to SGGs. So the implementation method should be sensitive to this.
I strongly do not want the on-call manager to deal with these. On-call shifts are stressful and busy, at least for EMEA/AMER.
Please do not tag "all managers globally" in a notification about this; I don't want to see these before- or after-hours. I would much prefer to go by the DRI model, and have us identify a specific role to deal with these. Maybe it's the SSAT Manager, if we believe these missed FRTs are strongly correlated with SSAT. Maybe it's the regionally-associated SGG manager. See next item for a suggestion.
The forest channel seems a reasonable choice for a Slack post, possibly inside the daily regional thread; all regionally available managers are tagged in the thread already (it respects OOOs), and participants could decide who-does-what (maybe we want to split the work). Most of the threads currently have zero comments in them, so it won't overload a busy channel/thread.
The alert is made as a STAR digest (i.e. a single STAR entry with a link to all Very Breached tickets) in the #support_ticket-attention-requests channel and [at]mentions the regional on-call managers
For similar reasons as Rebecca has highlighted, Manager on Call can be very stressful in the EMEA region, as we get a lot of STARs and also often have to coordinate or assist with Customer Emergencies. For this reason, my preference would be to update the Forest SGG like Jane did yesterday but also tag the global or impacted regional management to take action using the appropriate slack handle.
I wonder if there's an opportunity to partner with Senior Support Engineers on this?
Maybe a digest of very breached FRT go to both Seniors and Managers, where Seniors help facilitate assignment/first replies within their SGGs, pulling in managers for additional support when needed.
Seniors could offer to pair and/or add themselves as cc's to provide support (as needed) to the assignee.
Seniors could get a first reply out and encourage others to shadow for those who are interested in the ticket, but feel the ticket is out of their comfort zone.
Might make this feel less like yet another thing Managers are chasing down help for, while infusing and strengthening collaboration among the SE's and SSE's.
Definitely sensitive to this possibly adding more weight to already full plates, but thought I'd suggest an idea from a different angle that (to me) feels more natural as Seniors are already working out of the queue's alongside their SGGs and this would offer an opportunity to further strengthen our approach to FRT.
I think this makes a lot of sense, and while I'm pushing forward with my updated proposal which does not include senior Support Engineers at this time I do think it could be included in a future iteration.
My priority is to have some kind of alerting workflow in place as soon as we are able to, which means I am looking to implement the smallest scoped solution as quickly as we can with a commitment to continue to iterate on how we deal with Very Breached Tickets.
A few thoughts: (numbered list to allow for reference-ability)
Currently, Monday business hours typically begins with an average of 10+ tickets showing -2d breached. Many of which were raised late on the customer's Friday. If APAC sets out to clear all of those as a priority when business hours start, there is a risk that it will have an adverse impact on keeping ahead of tickets that are coming up for breach.
SCOPE CREEP (noted for a later iteration): In many ways I would like to see us move the net to a different point in the cycle. I appreciate this is a safety net to stop things getting overly breached, knowing that we won't ever stop everything from breaching (and in fact we have provision to miss 5%), but particularly for those that occur in the above scenario, doing something like a list 4-6 hours before end of business hours on Friday of tickets that will breach between then and the first 2 hours of business hours the following week could be useful. Again, this is really beyond the scope of the current topic, but worth keeping in mind.
Building on Rebecca's comments about the need for DRI and not overloading an already distracting and busy On-call manager responsibility, we need to avoid the risk of "diffusion of responsibility". We can be flexible about how that looks and it may differ from region to region and I would encourage us to be clear about what we agree on without insisting it has to look the same everywhere.
QUESTION (for all of us): how do we load balance this out globally? I think using the Forest slack channel is a good approach in terms of where this goes (STARs are already busy and distracting enough; SSAT review manager is a role requiring less timeliness than this warrants (for example if I am on PTO for 3 days of my SSAT review allocation, I don't get someone to swap, I just make sure I plan to get it done around those days of PTO)) but we also need to decide/document what the trigger is for this alert or if it will run on a schedule at various times during the week/business hours. I'm still pondering this so don't have suggestions right now, but wanted to raise the need to solve that.
I think I've addressed your concerns on diffusion of responsibility and load balancing globally (points 3 and 4) by my updated proposal, which is adding to the current daily message in spt_gg_forest. If I haven't though please do let me know so we can work out what needs to change for this iteration, and what could possibly be included in a future iteration.
I'm definitely in agreement with you on your second point, I would prefer to not solve that at this time though and prioritise getting an alert of some description in place first, with future iterations giving more focus on solving the problem before it happens.
APAC only: Jane, Ket and myself are comfortable with having the on-call support manager be accountable for Very Breached FRT Tickets during APAC shift hours. To be clear, "accountable" here means that the APAC support manager on-call will ensure that Very Breached FRT Tickets are taken care of, and we may eventually choose to delegate the responsibility for monitoring and actioning these to one or a group of support engineers.
Thanks everyone for contributing to this. It seems the most desirable first iteration is leveraging the spt_gg_forest channel for the Where to alert/report, and the on-call managers per region as the Who to report to.
Bearing that in mind I'll refine the proposal for the first iteration to:
An automated report of all tickets that are >=2 days old and have not yet had a first response, is attached to a message in #spt_gg_forest Slack channel, and pings the on-call managers for that day.
I think we could leverage the current daily welcome message (made by the Support Team Bot) that tags the regional managers in that Slack channel and add a section that includes the report of Very Breached Tickets. That way it's a single message with all the relevant information needed by managers, rather than adding additional messages that are vying for attention.
@gitlab-com/support/support-ops Shaun has asked me to get the first iteration of this underway while he is on PTO.
The desired outcome is as follows:
Append an additional section to the 3 daily Forest Group pings that get sent to APAC, EMEA and AMER. Currently the sections are The following people have scheduled PTO, SGG Report, Oncall Information. This would add a Very Breached FRTs section
This new section will have content of a list of tickets that meet the following criteria:
Are at FRT stage
Have an SLA
The SLA has been breached for more than 2 business days
Ideally the ticket numbers would be clickable links
The output could look like the following: (NB: These tickets are NOT very breached, I've just chosen a couple of random tickets for the purposes of illustrating desired format)
I should note - the list will often be empty. A blank list is ok in that case, though perhaps a Nil as the content would be useful in that instance as a check that it is actually an empty list as opposed to something having not worked.
In terms of timing - if the list of tickets could be queried as close to the time the regional forest ping is sent as possible that would be ideal - I am hopeful that could be in the 30 minutes prior to the post, but I appreciate there will be technical limitations of what can be achieved there so do lets dialogue on that if necessary.
I don't think the SGG Bot (which is what that is) is the correct method for doing this:
For one, it is going to require the bot to now access another system to gather data. We should avoid doing that
For two, the exact information is visible via the "All FRT and Emergencies" view in Zendesk. The tickets would be at the top of the view, given breach time.
Thank you for sharing your guidance on this Jason.
I do need to keep this moving though, so could you help me please by suggesting a way that would be a more correct method to achieve :
an alert in Slack of Very Breached Tickets (VBT), where a VBT is any FRT that has breached for more than 2 business days?
As mentioned previously, it would be preferable if this could come out once per day for each region with a list of tickets meeting the criteria, but I don't want to put limits on the ways you might suggest we achieve an alert as I did in my previous scoping of the desired outcome, so let's leave it as open as that. Can you help please?
Also, yes, we are aware that these VBTs are already visible to managers in the "All FRT and Emergencies" view in Zendesk, however the goal of this issue to implement an alert in Slack drawing attention to these. That view is only visible to managers; each region's managers will determine the best way to action on this alert and this may involve Senior+ SEs who do not have visiblity of that view.
We do not really doing "Slack notifications for xyz" like this anymore. It does not scale (and it not very accurate). We also have seen (time and time again) that it uses resources without producing any results (to the contrary, they end up ignored).
That view is only visible to managers
It is visible to Managers and CMOCs currently, not just managers. But that is also something that can readily be changed.
The problem statement is:
We have multiple day breached FRT tickets that do not get the level of visibility needed to prioritise them.
I disagree with that problem statement, namely in that we do have the visibility into them. The problem is not doing anything on them. Copying the current visibility we very much have outside of our primary tool (Zendesk) is not really going to change that. We already have the very thing we need to see the tickets and do something about them. If the view needs to be widened to be accessed team wide, that makes sense and would fall inline with normal procedures and documentation (and is very much doable).
Let me try to express what I was trying to say again, as the the previous wording might have not done so in the most efficient method.
We need to be careful in granting bots API token access, especially considering all Zendesk API tokens are at the admin level (and can be very destructive in nature). If at all possible, we should avoid granting a bot access to systems like that, as it opens it to new vectors of security risk. As such, I would advise we not have this tied to something that has not connection to Zendesk at this time.
As the core of the ask if just to take a list of very breached FRT tickets and post them in Slack, that is likely more doable. The mention of the SGG bot makes it seem like #spt_gg_forest is the desire and the goal is to post them around the same time SGG bot fires (you mentioned around 30 minutes, so I will assume at the same time unless specific otherwise).
The goal is to have a single line comment made listing the tickets in a comma separated fashion (with the IDs being clickable if possible).
Do I have that all correct? If not, please let me know where I am off so I can adjust my understanding. If it is correct, I can make an issue to determine how to engineer the solution you are seeking.
we do have the visibility into them. The problem is not doing anything on them.
That is absolutely correct.
But I think Slack provides a higher level of visibility/urgency than ZD simply due to the nature of the two tools and how we instinctively use them. In general I think we notice and respond to alerts in our key Slack channels more effectively than we perform regular reviews of the queue status pages in ZD. And unfortunately at present we are not so much in the position currently of regularly saying "I've run out of ticket/manager work - I'll check the queues" and more in the state of saying "oh-oh there's an urgent alert in Slack about a ticket - I'm pretty slammed but I'd better stop what I'm doing and take a look". If we could get everyone reliably checking ZD as often as they notice Slack messages that would be ideal, but I don't see that happening (in general that is - in this specific case we're the requirement is for one person to check one queue in ZD once per shift, which I guess a calendar reminder could take care of).
We also have seen (time and time again) that it uses resources without producing any results (to the contrary, they end up ignored).
The SGG alerts for breaching tickets have had a positive effect on our FRT handling, despite them only telling us what ZD already does. So do please count that as a win for your efforts implementing the Slack alert functionality.
Thanks Jason, and thanks too to you Justin for sharing the positive impact you've observed from the Slack alerting function!
Jason - a first attempt to answer your questions and come to an agreed understanding:
We need to be careful in granting bots API token access, especially considering all Zendesk API tokens are at the admin level (and can be very destructive in nature). If at all possible, we should avoid granting a bot access to systems like that, as it opens it to new vectors of security risk. As such, I would advise we not have this tied to something that has not connection to Zendesk at this time.
Understood. Thank you for the insight into this, that really helps to understand.
As the core of the ask if just to take a list of very breached FRT tickets and post them in Slack, that is likely more doable. The mention of the SGG bot makes it seem like #spt_gg_forest is the desire and the goal is to post them around the same time SGG bot fires (you mentioned around 30 minutes, so I will assume at the same time unless specific otherwise).
Kind of! To clarify, from discussions I had with Shaun about this prior to him being on leave, the preference was to leverage the existing daily Forest ping for this - because it already pings the Managers in the region who are present that day. The feedback received when Shaun first proposed this issue resulted in a scenario that didn't have a single answer that everyone would be happy with. So putting the information (alert) in the existing ping that goes out to all managers was the solution he decided on for this first iteration as it means that it :
doesn't assign the oncall manager for the day - at least one manager requested this NOT be done as the oncall manager responsibilities were already very loaded
doesn't send an alert with no one tagged, which would lead to significant diffusion of responsibility
supports regional managers to devise an acceptable workflow for who would be DRI for acting on this alert in their region (this is what I was referring to in my initial ping to support-ops when I said I will work with the regional managers in a seperate thread to document how we respond to/action the list.. Some regions may choose to use the oncall manager, others may choose to work with Senior SEs to action these, but getting the alert published near the beginning of each region's business day was the foundation for devising the process to action on it).
Given the need to not have a list generated from Zendesk under the bots actioning, I appreciate this answer doesn't actually meet the criteria you've explained, but please read on - the next response might suggest a way an alert as a line within the existing Forest Daily ping could be achieved while observing that decoupling...
The goal is to have a single line comment made listing the tickets in a comma separated fashion (with the IDs being clickable if possible).
Yes, that would work, though an alternative could be to have a clickable link that states how many VBTs there are to be actioned - so this could be a link to an issue that is generated with a title like There are currently 7 VBTs needing action and the link goes into an issue (or other artefact) that has the list of individual tickets. That's not a requirement though, just expressing that it does not have to be a comma separated list - the minimum needed in the alert is the count of how many VBTs there are, and the means to view those tickets included in the count (using a count of VBTs with a link to the list hadn't occurred to me previously).
Do I have that all correct? If not, please let me know where I am off so I can adjust my understanding. If it is correct, I can make an issue to determine how to engineer the solution you are seeking.
Have a read of my responses and let's see if we have the same understanding yet.
An aside: I am in agreement with your observation that the problem here is not exclusively about lack of visibility, but rather a combination of needing timely attention and action; the goal in this first iteration is to implement an element of alerting that I will work with the regional managers to devise an agreed method to give timely attention to that and action. I do presently review the VBTs as I have time, and the vast majority of these alerts will be on APAC Monday (there would have been 11 today, though 1 of them may have gone out on AMER Friday) and very few will go to other regions I expect. But we need the safety net in place to catch these more actively - Justin's observation about attention and load is very valid. This is very much a first iteration to get the safety net in place. I am keen to catch these sooner than 2 days breached - Shaun responded to that in this comment and notes we'll look to work towards that in a later iteration.
I think we can make something doable, but the specifics of it might not align with the above statements, namely:
comes from the SGG slackbot
I would not advise this, as the bot already is accessing google (calendar), Pagerduty, and the support-team.yaml (admittedly that will change in the future). I am not quite comfortable with the bot/scripts also having admin capabilities on Zendesk global.
even if we make a gitlab issue
something needed admin access to generate the list in the issue, which means either making one bot make issues (3 times a day) and another bot (sgg slackbot) make a gitlab.com API call to locate a corresponding issue (which could be tricky to do correctly).
I think it is better here if we focus on the core of what is being asked without digging into the "how" (i.e. look into the what and why).
As the core is a Slack message containing list of tickets (or a link to an artifact containing said list), I think this should be doable as something completely separated from other bots (that don't have the access to do so). The actual text of the message can be determined through the course of testing/development, so pinging managers (or not pinging managers) is doable.
My thought on the exact nature of the bot (if that is important) would be:
Runs 3 times a day
0700 UTC M-F for EMEA
1400 UTC M-F for AMER
2130 UTC M-F for APAC
Generates a message along the lines of:
Hey @person1 @person2 @person3, please review the following very breached FRT tickets: LINK LINK LINK LINK
Posts the generated message in the #spt_gg_forest
I think the above should cover the core of the ask here, would that be correct?
After discussion with support ops on what was feasible to implement while taking various factors into consideration, the bot has been built to ping at a similar time to the daily SGG Forest ping in each region and will list the FRT tickets that are 2+ days breached.
I have not yet raised MRs for handling this (I was caught a bit short and hadn't realised the bot was going live immediately).
@gitlab-com/support/managers I'm seeking clarification from Support-ops presently as to how the ping is set up and will work with you once I have the answer to that to seek to document this in the handbook. I'll do a notification issue once that's squared away.
A little more information for now:
There were many differing and sometimes conflicting preferences expressed through discussion in this issue. The bot means we still have plenty of flexibility for how we address these in each region.
Reminder: these pings should come through very very rarely - they are intended to be fail-state safety net to let us know of any FRT ticket that has been breached for more than 2 days. When these exists, it means that a customer with a
Normal priority ticket (8 hour response time) has waited 7 times the SLA time and not yet had a response;
High priority ticket (4 hour response time) has waited 13 times the SLA time,
Low priority ticket has waited 2 times the SLA time
This bot is designed to help us as managers be aware of the existence of these tickets and seek action on them without depending on passive checks of the All FRTs and emergencies view in Zendesk.
It's late on my Monday - I'll pick this up once I have clarification from support-ops and during slightly more sociable working hours, but I wanted to get this update here in case you didn't see my ping in Slack today when I realised that this may actually be live already.