Processing incidents during a fire-fight requires responders to coordinate across multiple tools to evaluate different data sources. Collecting and assessing metrics, logs, and traces and sharing these with a response team is time-consuming and challenging. We've streamlined this workflow by providing drag & drop uploads for these screenshots in a new Metrics tab on Incidents. Aggregate and centrally locate all screengrabs of metrics so that team members can quickly access and reference important charts.
Processing incidents during a fire-fight requires responders to coordinate across multiple tools to evaluate different data sources. Collecting and assessing metrics, logs, and traces and sharing these with a response team is time-consuming and challenging. We've streamlined this workflow by providing drag & drop uploads for these screenshots in a new Metrics tab on Incidents. Aggregate and centrally locate all screengrabs of metrics so that team members can quickly access and reference important charts.
On incidents, we are going to introduce a tab to surface metrics for responders during triage and investigation. The user will be able to upload images for charts or links to metrics in this tab. This allows them to centrally locate important images.
Users can upload images and links to a tab called metrics on incidents.
Proposal
Allow users to upload screenshots of metrics from their monitoring tools into a tab on incidents called 'metrics'. Users can also optionally add a URL so that others can easily navigate back to the original dashboard.
Here's the full flow:
Drag and drop field appears on metrics tab
Add a URL modal appears after an image is either selected or dropped in the field
Image is uploaded - no link has been added
Image is uploaded - link has been added
Metric already present - upload field below
UPDATE: We are NOT automatically surfacing metrics on incidents at this time. We are going to simple enable the upload of images. We MIGHT choose to enable the automatic surfacing of metrics from prometheus in the future.
@ameliabauerly Totally fine! Thanks for creating this issue. I've catalogued it under &1494 so we can keep track of it. We can promote it to an epic once we've progressed a bit.
I also updated the problem to solve to be a bit more specific.
@sarahwaldner - since I know you are re-organizing some issues right now, I wanted to share what I had on this one so far. As mentioned on Slack, rather than continuing to invest in trying to display metrics and logs in the tabs on incidents and metrics automatically, I think we could just make it easier for people to add their own.
The first idea I've been playing with is to have a drag and drop field within the tabs so people can more easily add in screenshots of their logs/metrics. Something like this:
No metric available - drag and drop field automatically appears
A screenshot is added, and the user has the option to add more
In what you're seeing here, I'm following the patterns established by the design management features. One cool thing about design management is that it allows users to comment on images. I don't know that we'd get that functionality for free but, it means that we'd have an already existing pattern in place for extending this feature to allow for commenting on images of metrics and logs. This could be pretty powerful, as it would allow people to leave notes for their colleagues on their screenshots of alerts and logs as they are adding them. As a reminder, here's how commenting works in design management:
Essentially, users could click on any part of the image and leave a comment. That...could be pretty powerful in the incident management context.
WDYT?
Also, a question - do we just want to allow users to add a screenshot, or do we also want them to be able to add a link to another metric service as well? I notice that DataDog does both, but I suppose that's because users can link to the robust metrics within the DataDog system. I'm not sure if they allow links to external metrics? Anyway, here's how they do it:
I know adding links to external services creates security issues for us so, I'm thinking sticking with screenshots is a good first step here. But, I also wanted to check in to see what you thought?
@ameliabauerly I LOVE this. I love so many things about this. The workflow, the fact that we can use/copy existing UX in the product AND I think that this will be hugely beneficial to the SRE team.
Maybe we should make this issue scoped to just this idea?
Also, let's solicit feedback from SRE in Slack on this idea! I think that it will be helpful for them when adding Grafana screenshots.
@sarahwaldner, thanks for your feedback, glad you're liking the idea :)
Maybe we should make this issue scoped to just this idea?
I think that's a great idea. I've taken a pass at editing this issue's title and description to scope it just to the metrics tab, since I believe we are pausing on the logs tab for now. How does this seem?
A couple follow-up questions:
Just confirming - do you want to stick with uploading screenshots for now or are you thinking we should try for screenshots + links? I suppose this is also something we can get feedback on from the SRE team but, just curious if you had any initial thoughts here?
Also, I'm assuming we'd apply this new functionality in both alerts and incidents, as both have metrics tabs. Is this a correct assumption from your POV?
Also, let's solicit feedback from SRE in Slack on this idea! I think that it will be helpful for them when adding Grafana screenshots.
Cool! Will post a thread in the infrastructure lounge and see what they say :)
I think that's a great idea. I've taken a pass at editing this issue's title and description to scope it just to the metrics tab, since I believe we are pausing on the logs tab for now. How does this seem?
Looks great, thank you so much.
Just confirming - do you want to stick with uploading screenshots for now or are you thinking we should try for screenshots + links? I suppose this is also something we can get feedback on from the SRE team but, just curious if you had any initial thoughts here?
I want to do both but let's tackle they one at a time starting with the preferred method by the SRE team. I am imaginging that the link is just an input field that receives and displays a URL!
Also, I'm assuming we'd apply this new functionality in both alerts and incidents, as both have metrics tabs. Is this a correct assumption from your POV?
I got some relevant feedback from the SRE team when they reviewed my proposal in Slack:
My takeaway from these comments is that, while a screenshot works for a first iteration, longer-term, just a screenshot probably won't be sufficient. Adding a link to explain where the screenshot was from will be important/necessary context.
I think we can accommodate this need, though, if we extend the initial screenshot uploading to include the comments that are available with uploaded designs currently. I put together a quick prototype showing how that could work, and how utilizing image comments should allow us to meet the SRE team's need to have both an image and the reference of where it's from.
There are, of course, different ways we could solve this problem but I was trying to utilize the patterns we already have in place with design management so we're not completely re-inventing the wheel here. WDYT?
@ameliabauerly Amazing, thanks for gathering that feedback. Yes, I like starting with the drag and drop of the screenshot followed by addition of the link.
I do like the idea of being able to comment on screenshots BUT I do not like having to click on the screenshot to see the comments. I think all of that needs to be immediately available when someone clicks on the metrics tab. My thinking is that during a fire fight people do not want things to be hidden. Thoughts?
My thinking is that during a fire fight people do not want things to be hidden. Thoughts?
Yeah, that might be true, @sarahwaldner. That being said, I'm not actually sure if every comment on every image needs to be immediately visible in the metrics tab. It seems possible that people may want more information just on specific graphs (for instance, the most recent graph) and not really need to see all the comments on all the others. I also wonder if having all the comments visible is essentially creating a separate discussion feed? If so, is that something we definitely want to do?
The benefits of allowing users to click to add comments:
We're following the pattern that's already been established with design management. In that way, we're not building anything totally new and untested, nor are we asking users to learn any new patterns.
We're allowing people to add links or comments but keeping those additional conversations contained in a way that doesn't create a separate discussion tab.
Let me know your thoughts on the above. We can always make this issue about just uploading the image and create a follow-up issue to continue discussing how best to add the link/comments as part of a separate iteration if that makes more sense :)
I also wonder if having all the comments visible is essentially creating a separate discussion feed? If so, is that something we definitely want to do?
Oh yeah good point. No we do not want a separate discussion.
We're following the pattern that's already been established with design management. In that way, we're not building anything totally new and untested, nor are we asking users to learn any new patterns.
Hmmm. I am not sure that this is a benefit. Design management is an entirely different use case with different personas. The people "using" this functionality in design management are NOT the same as the people reviewing metrics and responding to incidents, so really, we are asking our personas to learn a new interaction because they will not have used this before.
We're allowing people to add links or comments but keeping those additional conversations contained in a way that doesn't create a separate discussion tab.
I could see this being true.
We can always make this issue about just uploading the image
Yes. Let's do this. I do not even think we need to create an issue about the comments part. This has not been requested yet. We can wait to hear this feedback before we create an issue for this.
Yes. Let's do this. I do not even think we need to create an issue about the comments part. This has not been requested yet. We can wait to hear this feedback before we create an issue for this.
Okay, thanks for your thoughts, @sarahwaldner! Since we have a plan for this first iteration I've marked this issue ready for dev. Since we're pausing on adding the metrics tab to incidents right now, should I just make this issue about alerts so we can at least implement this design there? Or, do you plan to backlog this feature for both incidents and alerts until a later date?
I am going to solicit feedback from the SRE team in a different comment. If this is really useful for them, then we will schedule it for 13.5. I think that this is going to be a nice-to-have and that we will put it in the %Backlog for implementation after a few rounds of on-call schedule management.
Apologies for changing priorities on you frequently in the last week. It has been hard for me to orient after the team realignment. I am doing my best not to waste your time on non-critical issues 😓
Amelia Bauerlychanged title from Improve the metric and log tab experience on alerts and incidents to Allow users to upload images to the metrics tab on incidents and alerts
changed title from Improve the metric and log tab experience on alerts and incidents to Allow users to upload images to the metrics tab on incidents and alerts
Amelia Bauerlychanged the descriptionCompare with previous version
I need you assessment on this issue for how important it would be to your team.
We are adding a Metrics tab to alerts & incidents where you would be able to upload screenshots of charts from Grafana (or anywhere). It would be stored in an obvious palce and be immediately accessible to responders during an incident.
I am working on critically prioritizing issues so that we can get you onto GitLab Incident management as soon as possigle.
Would you consider this feature critical or a nice-to-have? (or not worth our time)
Our workflow related to adding metrics to incidents currently leverages comments to add a screenshot (and most times a URL to the source graph where the screenshot was taken). An example: gitlab-com/gl-infra/production#2528 (comment 397360159)
In the incident review (using the same issue) we then add them to the description or reference the comment via comment URL.
If we still have the above capability I mention, in the Incident Issue type, then I'd consider this feature a nice-to-have, but would definitely be looking forward to replacing our current workflow for this with a tab focused on metrics to alleviate them currently getting lost in fast moving incident issue comments.
I would like to call out being able to also add a URL as metadata for the uploaded metrics screenshot would be something I'd want to see with this feature.
Would you be able to dogfood this immediately? Or would you need to wait until we build on call schedule management into GitLab to be able to take advantage of this usability improvement?
As a FYI on this thread - I added a simple modal so users can introduce a URL as they are uploading the screenshot. This will hopefully help make this feature more useful to the team while at the same time avoiding the secondary discussion issue we were discussing above, @sarahwaldner. I've updated the screenshots in the issue description to reflect the changes :)
@sarahwaldner / @ameliabauerly - I'd agree with Brent. I guess I would call it a higher priority nice to have? Not quite critical as we have something that "works", but what you are proposing would greatly improve what we currently do.
Would you be able to dogfood this immediately? Or would you need to wait until we build on call schedule management into GitLab to be able to take advantage of this usability improvement?
We'd be able to dogfood this ahead of any oncall schedule management.
@sarahwaldner - I notice this one is scheduled for 13.6 but it isn't on the planning issue. Since it sounds like the SRE team would be able to dogfood this immediately, it seems like it'd be another great improvement we could add to incidents asap. Maybe it's already on your radar but flagging just in case :)
Sarah Waldnerchanged title from Allow users to upload images to the metrics tab on incidents and alerts to Upload metrics images to the metrics tab on incidents and alerts
changed title from Allow users to upload images to the metrics tab on incidents and alerts to Upload metrics images to the metrics tab on incidents and alerts
Sarah Waldnerchanged the descriptionCompare with previous version
I think it's worth investigating re-using the design management vue component. It seems like it should be a simple refactor because it's mostly decoupled from the rest of design management. My proposal would be to move it under /vue_shared and re-use it from there.
Would either of you have capacity to look at a draft MR and validate my approach?
Do you have any reservations with the general proposal?
@tristan.read likewise, happy to give a first-round review if you like.
I think moving to vue_shared is a great idea. The design_dropzone.vue component is slightly coupled to design management, but given it's architecture I think it won't be too complex to refactor (e.g. the slot should make the refactor on the design management end quite straight-forward I imagine). Ping me on anything you like!
@ameliabauerly@sarahwaldner A few questions on this:
It's my impression from reading the comments below that the main goal is to add this to Incidents so we can dog-food Incident management
I would suggest, as a goal to de-scope the first iteration, that we limit manual uploading of images to Incidents only.
Would that be ok?
We could then add the automatically generated metrics from the alerts into the Incident in a subsequent MR.
Also,
Do we want to add manual metric uploading to alerts at all? It seems like an odd workflow to me, but maybe not :)
^ To add to this - there is some complexity in the idea of allowing metric uploads for alerts. For a couple of reasons:
What happens if there are different metric images uploaded on the alert and the incident? It should be fine to display both in the incident, but I don't think we can go in the other direction and show incident-uploaded metrics in the alert.
Permissions. Incidents have different access to alerts/metrics. A user may have permission to see the incident but not the alert. It might be confusing to see uploaded metric images attached to the incident, but not those attached to the alert. As far as I can tell the design doesn't account for this case.
A few questions on this: It's my impression from reading the comments below that the main goal is to add this to Incidents so we can dog-food Incident management
I would suggest, as a goal to de-scope the first iteration, that we limit manual uploading of images to Incidents only.
Would that be ok?
Yes that's great. Let's scope this to ONLY metrics image uploading on incidents.
Do we want to add manual metric uploading to alerts at all? It seems like an odd workflow to me, but maybe not :)
Maybe. Depends on how people triage and what information is available in the alert. Let's hold off. Good call, thanks
What happens if there are different metric images uploaded on the alert and the incident? It should be fine to display both in the incident, but I don't think we can go in the other direction and show incident-uploaded metrics in the alert.
We are going to focus on incidents. I've updated the description to only be about incidents.
Permissions. Incidents have different access to alerts/metrics. A user may have permission to see the incident but not the alert. It might be confusing to see uploaded metric images attached to the incident, but not those attached to the alert. As far as I can tell the design doesn't account for this case.
Great point. We are only going to enable uploading of metric images to Incidents.
Overall
This issue is ONLY going to be about uploading images to incidents. I've removed mention of alerts from the description.
Edit/Delete: I think it makes sense to make edit access levels the same as editing the incident itself - Guest for own incidents, Reporter for others' incidents.
View: Following the logic above we can make it the same as incidents - Guest for private projects, Any for public projects.
My reasoning:
metrics are useful to get a full picture of the incident.
this new feature replaces the current process which involves pasting screenshots into the issue/incident description, so we're not lowering any permissions. Incidents may still be marked as confidential.
The one potential piece of confusion with this is that the interactive metrics charts are Reporter restricted and Alerts are Developer. To query metrics associated with an alert, the user would need to have Developer permissions.
This means that some users (Guests and Reporters) will see metric images uploaded to the incident, but not the interactive metric chart if there is one.
@tristan.read - your breakdown of permissions for editing/deleting and viewing makes sense to me. One quick question with regards to this:
Guest for own incidents, Reporter for others' incidents.
Does this mean that a Reporter can edit/delete a metric image that someone else has posted, or just that they can edit/delete an image they've posted on someone else's incident? I presume the latter but, wanted to confirm.
The one potential piece of confusion with this is that the interactive metrics charts are Reporter restricted and Alerts are Developer. To query metrics associated with an alert, the user would need to have Developer permissions. This means that some users (Guests and Reporters) will see metric images uploaded to the incident, but not the interactive metric chart if there is one.
Interesting! I don't know that this is a huge issue because I don't imagine that Guests/Reporters will be doing a lot of digging into metrics. They might, however, want to at least see them.
Based on Sarah's comment above, it sounds like we're not going to focus on porting the alert metric over to incidents right now anyway. But, in terms of figuring out the right experience for this metrics tab in the longer-term, adding a couple of thoughts/questions.
If alerts have a developer permissions level, I presume that means that a Guest/Reporter couldn't view the associated alert if they clicked through to it. But, I'm wondering why that permission level needs to automatically apply to interactive charts added to incidents, especially if a metric is ported over to the incident? I guess I'm wondering if, when a metric is added to the incident, we can't update the permissions for just the metric, so that all information visible in the incident is actually visible to all people who can view the incident. Is that super complicated? Or, are we concerned that displaying the interactive metric to someone with Reporter permissions levels might reveal sensitive information?
I guess I'm also curious as to what would happen if a user embedded the same metric from the alert manually in an incident comment. I can't remember the permissions levels for embeds but my assumption is that interacting with the metrics in this scenario isn't limited to Developers -- or is it?
My assumption is that it would be good to have all metrics display together in the same tab in the longer-term (including the "embedded" metric from the alert, any additional images of metrics a user chooses to add - everything), and that anyone who can view the incident should also be able to view the metrics. It seems like that's currently the case with embeds and images posted in issue comments right now (anyone who views the issue can view whatever metrics are posted to it), and I'm guessing we'd want to preserve that same visibility in the incident metrics tab. Are those fair assumptions, @sarahwaldner?
Again, this isn't a problem we need to solve for this first iteration, but it would be good if we had a vision for how this could work so that we can start working towards it 💪
This is complicated and I would like to simplify it. Please do not do ANY work related to pulling in Prometheus metrics. Please ONLY focus on creating an interface where someone can upload images and links.
Any images that are uploaded should have the same permissions as the associated incident.
@ameliabauerly I kind of followed your break down. I do not want to worry about displayed embedded metrics on incidents. We are not investing in the metric category right now.
@sarahwaldner gotcha 👍🏻. We'll avoid bringing in prometheus metrics. This will simplify permissions and the tab overall. Thanks for the clarification 🙂
How do you plan to persist the file on ObjectStorage?
I've followed the steps here to add ObjectStorage support to the uploader.
You can see my MR which I'd planned to be ready for review shortly at !46845 (merged).
How do you plan to persist the file on ObjectStorage? With the current technology you selected it will not be possible to make use of Sidekiq jobs.
We are limiting to a single file upload at once, and this is handled by saving the model and the file handling being taken care of by CarrierWave + the uploader.
Is there another way I should be doing this?
Are there plans to limit the upload size to keep the controller time under a reasonable threshold?
I've added a 1MB limit within the GraphQL resolver to avoid unnecessary uploads.
I've followed the steps here to add ObjectStorage support to the uploader.
That is a good starting point, but without direct upload support at workhorse level, you will end up uploading to object storage in the rails controller. This is something we invested a lot of time and effort to move away from.
I've added a 1MB limit within the GraphQL resolver to avoid unnecessary uploads.
It will keep the upload time a bit under control, but we will still accept all the incoming data and pay for the ingress traffic.
This limit will be enforced too late, the request will be processed by workhorse in any case and temporarily dumped on disk, only to be rejected by the rails controller.
We are limiting to a single file upload at once, and this is handled by saving the model and the file handling being taken care of by CarrierWave + the uploader.
Is there another way I should be doing this?
Implementing direct upload is the proper way to handle this situation, but unfortunately, it cannot be implemented over a graphql query.
Will it be possible to have this as a regular rest API? In that case, workhorse could upload on object storage on rails' behalf.
Some thoughts @ameliabauerly, I'm interested to know what you think:
We probably want to allow users to link to images, so they can be referenced in the incident discussion. We can use the image itself as the link. The link would point to the url of the standalone file, and clicking it navigates to that image in the browser. Just like how images currently work in issue descriptions/discussions.
I'd imagine we'd also want allow users to remove images. I was thinking a delete icon in the corner of the card:
Which would open up a confirmation dialog, prompting the user to confirm deletion of the image.
These could be follow-ups for sure. Though the delete one would be good to do soon from the perspective of abuse-mitigation.
We probably want to allow users to link to images, so they can be referenced in the incident discussion. We can use the image itself as the link.
I think this is a great idea, @tristan.read! I suppose we'll need to define behaviors for when the user has added a URL to the image using the modal, and when they haven't.
If they haven't added a URL, clicking on the image will open up the image in a browser tab, just like with regular images in issues.
If they have added a URL, would clicking on the image still open up the image in a new tab, or would it just open up the link that's been added? I suppose we could also have both behaviors present - clicking on the link in the header would open up whatever link they added in the modal while clicking on the image itself will open up the saved version of the image, just as it would in the first scenario. I can't decide, though, if it would be confusing to have two different click options in the same image. WDYT?
I'd imagine we'd also want allow users to remove images. I was thinking a delete icon in the corner of the card
Yeah, I agree this would be a nice feature! I had originally played around with a 'x' icon in the corner of the image as a way of removing it. I also played around with an icon button, a plain trash icon and a text button in my figma file😅 I didn't add the delete button into the issue because I wasn't convinced I had the right visual solution yet. But, in looking at your screenshot, I think something like that could work just fine!
It looks like your mock-up is using the small icon button. Do you think we could try the regular icon button to match the edit button above it? Something like this:
Would that work?
The delete option seems like it would be good to include in the current iteration, if we can. The image-as-link thing could come as part of a separate iteration, if needed. Up to you and how complicated it would be to implement :)
A reminder @seanarnold - since as far as I can tell delete functionality isn't in the current implementation of the API - we will each have a small piece of work to add this after the main MRs are merged.
I'll raise a follow-up for 13.7 so we can track it / account for it in planning.
Yeah please create one. I will hold off working on this until the existing MR is approved in case there are any unforeseen changes that need to be made.
We may need to change how we manage the image uploads, and move away from GraphQL, instead using Direct Upload. This has quite a few steps to implement, so this may bump out of 13.6 which will be unfortunate.
@seanarnold Thanks for the heads up. This will have an impact on the frontend too - the apollo graphql client does state management for the frontend so this will need to be switched to something else, probably vuex. Let me know what the plan is as soon as you know so I can start implementing the network calls 🙂.
@crystalpoole Stil WIP - I have got image uploading working and now need to find a way to provide a public link for the uploaded images.
@tristan.read I can provide you with the API spec (and probably working code) tomorrow so you can start to make your changes 🙇.
@sarahwaldner No, it didn't get merged in time. But it will be available for the infra team to trial before 13.8, since it will be merged this week and will have a project-specific feature flag.
Myself and @seanarnold will focus on getting the delete functionality (MR here #291011 (closed)) merged and then we can remove the feature flag and release the feature as a whole.
Status update: Metric uploads have been enabled on gitlab.com for the gitlab-org/gitlab group, as previously discussed. Internal users may start using it if they wish. I also activated the feature for monitor/monitor-sandbox for testing. Note that this is deployed to the canary environment only at the time of writing. Activate canary via https://next.gitlab.com/. now available on production.
@tristan.read Thanks for the update! I was able to get a screenshot of the feature from the example - it looks great, I am so excited. The RPI is ready to go for when the other MRs are merged.