SLA Exploration
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
This issue exists to capture the exploration on SLA functionality for Category:Service Desk.
Additionally, we also want to capture how SLAs could fit into other areas of GitLab (work items would support SLAs, so how do we cover most use cases). Can we reuse existing functionality (incidents SLA feature) and how does that work with on-call schedules and escalation policies?
Existing functionality
We already have a rudimentary SLA functionality for Incident Management which is located in the Enterprise Edition. The general idea is to define a time-to-resolution target in minutes (15min increments).
In the incidents list view a column displays the remaining time until breach in hours and minutes.If the SLA is missed it automatically adds the missed::SLA label (which gets created if it doesn't exist) and the text in the list view will be Missed SLA. If the incident state is set to "resolved" in time the text will change to Achieved SLA.
Related files:
ee/app/assets/javascripts/issues/show/components/incidents/incident_sla.vueee/app/assets/javascripts/vue_shared/components/incidents/service_level_agreement.vueee/app/services/incident_management/incidents/create_sla_service.rbee/app/services/incident_management/create_incident_sla_exceeded_label_service.rb
POC: Time-to-first-response timer on Service Desk list
I started exploring that area from the time-to-first-response timer because this is a very useful feature for Service Desk agents to find the most urgent ticket. The visual component makes it easy to scan the list of tickets.
A first iteration POC is recorded in this MR that adds a SLA timer for time-to-first-response (TTFR): Draft: Service Desk SLA timers [POC] (!153455 - closed)
First iteration
I introduced a hard-coded SLA (6 hours) and a static timer. The SLA target time was calculated in the frontend (once on page load).
The timer is red when the SLA target is missed, yellow if it's less than an hour to the SLA target, and green if there's more time left. The text is a short representation of remaining time like 6h, 30m or -1d.
I realized that we probably want to move SLA target calculation to the backend because we'll eventually allow more complex rules for that calculation.
Second iteration
I moved the SLA target timestamp calculation to the backend. It's still a hard-coded 6-hour offset of the created_at timestamp, but could potentially be more complex. Especially if you think about introducing business hours and holidays. On ticket creation, we could calculate the time_to_first_response_breach_at and use that persisted time from then on. No need to recalculate it.
If you're working longer in the list view the static timer calculations don't work for you, so I added dynamic timers that update the timer value every 10 seconds. We can play with the interval.
Third iteration (TODO)
I'll explore how we can add a generic system for processing SLAs that can be applied to specific work item types and legacy issues. I'm thinking of a yaml configuration stored in the database as a first iteration. This is to reduce the overhead of pulling and syncing from the repository and making it accessible for non-developer personas in the future because we can build a visual editor on top of the yaml.
So the first goal is to translate a basic time-to-first-response SLA into a yaml configuration and make the time accessible to the SLA timer in the frontend.
The second goal is to define another SLA.
Fourth iteration (TODO)
This will have two goals:
- What happens when a SLA is missed? In Category:Incident Management we already have the functionality to add the
missed::SLAlabel once the SLA is missed. How can we map this to theYAMLconfiguration file so the user can decide which labels to apply. - How can escalation policies be used in that context? Can we trigger an escalation policy when the SLA is missed?
Fifth iteration (TODO)
Piece it all together. Lay out an iteration plan on how to build native SLA management into GitLab.
The bigger picture
In general, a Service Desk issue is just an issue with the service_desk_reply_to field filled with an email and the author is the GitLab Support Bot (Userss::Internal.support_bot). A while ago we added the Ticket work item type which is not used right now, so we didn't migrate Service Desk issues to tickets yet. The devopsplan stage is currently working on feature parity of legacy issues and work items, so issues can be migrated to work items. Once achieved Service Desk tickets could also be introduced.
Our goal should be to build all future features that could also be valuable for other issue types or work items so that we can enable them for specific work item types. If we're planning SLA functionality it should be generic and potentially support all work item types (and legacy issues). In other words: SLA shouldn't be unique for incidents and/or Service Desk issues/tickets.
Types of SLAs in support
In support, you usually measure the following SLAs:
- Time-to-first-response (TTFR) which measures the time from ticket creation to the first public comment that sends an email to the requester.
- Time-to-next-response (TTNR) which measures the time from the first unanswered comment of the requester (or any external participant) to the next public comment from an agent that sends an email.
- Time-to-resolution (TTR) which measures the time from ticket creation to closing the issue (or marking the incident as resolved).
There are more metrics that are commonly used and defined, but these are the basic SLAs.
Questions I'm exploring right now
- How does a generic SLA system look like?
- How can we design such a system that provides value for our customers quickly?
- Can we reuse existing SLA functionality?
- What's the tiering? Will SLAs be a GitLab Premium feature?
- How can we ensure it's easy to set up common SLAs for Category:Service Desk and Category:Incident Management use cases?
Out of scope
What should we not focus on in this exploration?
- Reporting for SLA metrics. Reporting is an integral part of every support desk and we can use the value stream dashboards today to build stages based on applied labels and see how long tickets stay in each stage. SLA metrics will definitely help in better understanding how your support organization is performing and where action is required, but this is a whole exploration for itself because these kind of metrics are new to the GitLab analytics.
