Responders need to be automatically notified when an alert is triggered. PagerDuty is one of the most common on-call and paging tools in market and is used by many of our enterprise customers. We want to enable the creation of PagerDuty incidents from GitLab issues via the PagerDuty Incident Creation API.
Enable creation of PagerDuty Incidents from a specific GitLab issue. This could be achieved via a quick action that sends a GitLab issue to PagerDuty and creates an incident
Design
We are designing for the following workflow:
User creates an incident issue
They realize they need to page someone to address the incident
They utilize a slash command on the incident issue to create a PD incident (and thus page the appropriate people).
To allow this to happen, we're imagining we'll need a space to configure a PagerDuty integration and to create an appropriate Slack command.
Configuration
As part of #119018 (closed), we're adding a tab for PagerDuty integrations. The plan is to add some additional introductory text and an additional field to this section to enable users to create PagerDuty incidents from GitLab issues. The required updates are highlighted in the following mock-up:
Slack command
The proposal is that users can create a PagerDuty incident by utilizing the slash command /pagerduty. Assuming the configuration has been correctly completed, typing in this command will automatically create a PagerDuty incident.
We'll also need to ensure that, when the slack command is created, it's added to the prompt screen that appears as users type:
Permissions and Security
Documentation
Documentation Required. Please add a new section here.
Testing
What does success look like, and how can we measure that?
What is the type of buyer?
Links / references
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
At first glance, I started thinking about these questions:
How will the user learn about this feature?
Is there anything we can do aside from the documentation to make the feature discoverable and easy to use? For example, add some type of information message to the GitLab issue letting the user know about the new quick action?
How will the user learn about this feature? Is there anything we can do aside from the documentation to make the feature discoverable and easy to use? For example, add some type of information message to the GitLab issue letting the user know about the new quick action?
Hmm. I am not sure. Can we ask around to other teams and see how they have publicized API features that their stages have built?
PagerDuty incidents trigger for their respective services using an integration ID. We currently trigger incidents for only a small handful of services, but we're planning on expanding the number of services. At which point, we'll have a many to one relationship between PagerDuty services and GitLab projects.
If we were to reverse the workflow, and opening an issue in a project could trigger a PagerDuty incident, we would need to select from a set of PagerDuty services. I believe this could all be achieved using a single API key that has the proper permissions.
I've prioritized this issue for my team based feedback from @brentnewton this morning during a meeting where I demoed alert/incident management (so we can plan the Simulation day). In the event someone needs to create an incident manually and page someone, I was thinking it would be helpful to be able to easily do this from within GitLab leveraging the PagerDuty incident creation webhook.
Additionally, for GitLab customers who need a paging solution - we are currently NOT adding on-call schedule management or paging to GitLab. These are basic needs of a response team so I am working to alleviate that pain by providing some simple webhooks to use PagerDuty for these purposes.
Please share your thoughts and comments. Ideally this directly benefits your workflows and you can use it.
@sarahwaldner - trying to quickly get up to speed on this issue so we can have a plan for 13.2, if necessary. It sounds like this is the workflow we're hoping to achieve:
User manually creates an incident issue
They realize they need to page someone
They utilize a slash command on the incident issue to create a PD incident (and thus page the appropriate people).
To make this possible, I imagine we'd need the following:
Figure out where the PagerDuty webhook would be configured
Define the slash commands we would be using to create the PagerDuty incident
In terms of item 1: defining where the PagerDuty webhook would be configured. My hope is that we could do whatever configuration is required for both this issue and #119018 (closed) in a single PagerDuty section. We'll just need to figure out where we want it to go. As per discussion on #119018 (closed), all the integrations currently exist within Settings > Integrations. But, we're slowly moving all the Operations settings into Settings > Operations, so all the Operations settings items exist in the same place. Perhaps we could consider adding the PagerDuty integration/webhook settings within the already existing Incidents section on Settings > Operations. That way, we're continuing the push to have all the Monitor settings configurations happening on a single page, rather than on two separate pages. Thoughts?
For item 2, the slash command for creating a PagerDuty incident. Here are some example slash commands:
Perhaps we could do something like:
create_pd_incident
But, do we need to amend any information to that in order to create a useful incident? Like, do we need to add a title, an assignee or any other information to ensure an incident is successfully created and the right person is paged? I'm not sure what information PagerDuty needs the user to provide to create an incident and page the appropriate people. I'm guessing this will be something we'll need engineering help to define? Or, perhaps we can better define this piece as we work through development?
Let me know your thoughts and I'll continue refining the proposal here. Thanks!
Let's create a new PagerDuty integration section in Settings > Operations
Define the slash commands we would be using to create the pager duty incident
I think we want the name spelled out in the slash command so that it is clear what it is in the list of slash commands. /pagerduty or /pagerduty_incident or /send_to_pagerduty
We will need inputs. Let's pull in @AnthonySandoval and @dawsmith. Do you have an example payload of an alert that you send to PagerDuty to create incidents? We are trying to determine what inputs the user will need to enter when they are sending GitLab issue to PagerDuty. I am guessing at a minimum it will need a paging policy or name of someone to route the incident to directly.
A simple /pagerduty should suffice. If we are able to wire the PagerDuty service to the project, we're all set. All of the other available data (payload) needed to populate the PagerDuty incident is in the issue.
@sarahwaldner - Responding to your comment above in a new thread to leave space for Anthony and Dave to respond to your previous question.
I'm trying to put together a design proposal for this issue and for #119018 (closed) simultaneously, as I'm hoping we can have configuration for both pieces occur in the same place.
We already have a proposal to re-organize the incidents section on Settings > Operations to include all the integrations required to streamline incidents. As part of that proposal, we had been talking about pulling the Grafana authorization for embedding charts into the Incidents section. Since we're now talking about adding in a PagerDuty integration as part of the incident toolkit, I'm wondering if we should start re-organizing the incidents section straightaway?
If we combine the Alertmanager, PagerDuty and Grafana integrations into a single Incidents section, it could look like this:
Incidents section collapsed
Alerts tab
PagerDuty tab
Grafana tab
Thoughts on this?
In terms of the fields that are required to allow users to create PD incidents from GitLab issues - I'm basing my thoughts on the current Slack notifications settings page, since I think the configuration on that page is what makes it possible for us to utilize slash commands to post things to Slack (though, of course, it'd be helpful to have engineering confirm this assumption).
On that page, it looks like we need, at a minimum, a toggle, a field to paste in a webhook URL, and a save button. Since we're planning on utilizing slash commands, I don't think we need any additional triggers here (a trigger would be, for instance, create an incident every time an issue is created, updated, etc - that seems like it's not a great idea here but, let me know if you disagree). So, my thought is that, to complete the work required for this issue, we'll need a toggle, a save button, and a URL field. I've included these in the design for the PagerDuty tab shown above. I've also added in a secret token field as I think we might need it for #119018 (closed) but, I'll start a separate discussion on that issue to confirm that piece.
Am I understanding how this would be set up correctly? Are there any other fields we'd need to ensure everything links properly between GitLab and PagerDuty? Do we need to rope someone in from engineering to do any verification of the fields required here, or of the approach we're considering?
Designs look fantastic. I love this direction and this that this is ready for engineering barring any missing fields for authorizing the two tools and the integration.
Am I understanding how this would be set up correctly? Are there any other fields we'd need to ensure everything links properly between GitLab and PagerDuty?
Yes, your understanding of the workflow is spot on.
Do we need to rope someone in from engineering to do any verification of the fields required here, or of the approach we're considering?
Yes, @ClemMakesApps@crystalpoole can you please assign someone from engineering to take a look and confirm what information a user needs to provide so that we can build a slash command that creates an incident in PagerDuty from a gitlab issue. Thank you!
Thanks, @sarahwaldner! I'll add the designs and a summary of what's been discussed into the issue description.
Should the label be changed to workflow:planning breakdown or workflow:refinement? I'm not sure which we're using right now to say, "we need engineers to investigate the proposal further." @ClemMakesApps, WDYT?
The designs were reviewed by engineering as part of #119018 (comment 361449699). The only possible "gotcha" as far as I can see is whether or not we need to add an assignee to ensure the correct person is paged when the PD incident is created. But, from this comment, it seems like simply adding a generic slash command to create the PD incident is sufficient for a first iteration. As such, I'll mark this ready for development.
Moving this issue to the backlog as we are prioritizing building on-call schedule management in place of PagerDuty. It can be reprioritized if the need arises.
@sarahwaldner Any update to this? Currently, when someone is listed as on-call in Gitlab and an Incident is created, it sends an email to that person. But email is not a great way to alert an on-call person, as it's easy to get lost in all the other emails. Additionally, it doesn't do anything if it's 2am, and you get an email, nobody is going to wake up for an email. Having the ability to send a PagerDuty alert when a new incident is created would be amazingly helpful. Even if we could add a custom email address in this box would be better than nothing.
Thanks for commenting on this issue, @NeckBeardPrince!
Currently, when someone is listed as on-call in Gitlab and an Incident is created, it sends an email to that person. But email is not a great way to alert an on-call person, as it's easy to get lost in all the other emails.
Definitely agree! Sending an email notification was our first pass at introducing some basic paging for on-call schedules. But, we are planning more robust paging options, including phone calls and text messages. Here's the epic tracking that work: &1438.
Would something like that help address your concerns?
I'll also loop in our new PM, @abellucci. Alana, we have an existing PagerDuty integration (which you can see in Settings > Monitor > Incidents). If we build on this integration a bit more, and allow people to create PagerDuty incidents from GitLab, people could essentially utilize PagerDuty for paging. Doing so might enable people to experiment with our incident management workflow while we work towards building out more robust paging options. Something to consider!
@ameliabauerly That epic seems pretty heavy on using Twilio for notifications, but I guess that's a step. Since we use PagerDuty for everything else, it would be great if we could just plug in a webhook, and we're set. Right now, I'm trying to come up with a way to create a PagerDuty alert when an Incident type or even an issue with a specific label is created. But that doesn't seem possible right now.
Since we use PagerDuty for everything else, it would be great if we could just plug in a webhook, and we're set. Right now, I'm trying to come up with a way to create a PagerDuty alert when an Incident type or even an issue with a specific label is created. But that doesn't seem possible right now.
Gotcha, that makes sense @NeckBeardPrince! Yeah, I don't think that would be possible until we complete this issue. Thanks very much for this additional context. That's super helpful!