We cannot improve our model for Troubleshoot unless we know what's not working well today. Metrics and direct feedback by users is the way to learn about it.
We have a generic feedback form for users within Duo Chat. We don't have a way for users to give context that the feedback is about the RCA feature. With increasing adoption of the feature, the number of feedback by the users is not increasing.
The aim is to capture a proof of value metric for RCA. We don't have a way to learn from the users if the response suggested was helpful or not. There are some lagging indicators we could measure at a project level but it would be great for users to have a way to give feedback immediately.
Proposal
After integration with Duo, ensure the form input for feedback happens in relation to RCA feature. Propose mentioning in the form which feature (e.g. Troubleshoot) users are providing feedback for.
Experiment plan by the PE team to test the hypothesis: A strategically placed mechanism that allows users to provide quick feedback for response generated by RCA will increase the number of feedback received for RCA
Set up tracking for Time to successful pipeline and Time saved by using RCA to triangulate with direct feedback for a more rounded and credible inference in future.
We also need to figure out how the user reactions will be stored and can be viewed as reports later.
Intended users
Feature Usage Metrics
Does this feature require an audit event?
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
@rutshah the modal component is built in a such a way that you can enter custom content specific to your feature. You can read about that in storybook documentation or see it in action on this MR
I realize that currently, we can't distinguish feedback messages by the chat capability the feedback is about. (E.g. our data should say this feedback is a feedback for root-cause analysis.) To do so we need to do two things:
We should first try to measure how successful a feature is though a non-user feedback metric (for example, % of suggestions accepted). User feedback should be an extra, not primary method.
The current feedback mechanism in Duo Chat can be improved to simplify giving feedback - but I am not sure Katie can take this on soon. If another designer wanted to take that on, we would be happy to support that.
We should first try to measure how successful a feature is though a non-user feedback metric (for example, % of suggestions accepted). User feedback should be an extra, not primary method.
@jackib This is in line with how @abellucci is approaching feedback for Vulnerability Resolution and Vulnerability Explanation. From 2 weeks ago in this thread:
In my experience, users don't always fill out the form. Looking at Tableau for GitLab Duo, we have roughly 100 responses/month. Chat is available for all of our users, whereas VR will only be for Ultimate. There is value in direct customer feedback but I think a better use of our time would be to implement product data usage. Merged MRs speak for themselves, without asking the user to fill out a form.
The current feedback mechanism in Duo Chat can be improved to simplify giving feedback - but I am not sure Katie can take this on soon. If another designer wanted to take that on, we would be happy to support that.
@jackib I want to take this up in 17.4. @Becka and I can collaborate if required.
Here is an earlier issue that has some context for how we go where we are. We do not have to stay with that approach but I think the context will help you.
Not urgent or blocking, but you can check with Katie to see how this fits in to the pattern library roadmap.
The current feedback mechanism in Duo Chat can be improved to simplify giving feedback - but I am not sure Katie can take this on soon. If another designer wanted to take that on, we would be happy to support that.
@jackib I want to take this up in 17.4. @Becka and I can collaborate if required.
@veethika@Becka Can you please tag me on the relevant issues where you are working this out? I'd love to stay in the loop on whatever mechanisms we are considering for tracking this style of user feedback. My primary concern is to make sure that whatever system we come up with is aligned with the long-term systems we are developing to measure proof-of-value metrics across Duo.
...on Duo Chat but removed it because people wanted to give more information, e.g. "the answer was helpful but incomplete". I tested a couple of different options, which lead to the "give feedback to improve this answer" link on the bottom of the chat bubble. That link triggers a modal where each feature can put in custom questions.
@katiemacoy I'm bringing our conversation here(from Figma) so @rutshah and @rayana have more visibility.
We're thinking of the following reasons for introducing a quick feedback mechanism:
As Rutvik mentioned in the description of this issue: to capture a <code data-sourcepos="9:75-9:95">proof of value metric</code>for RCA. We don't have a way to learn from the users if the response suggested was helpful or not. There are some lagging indicators we could measure at a project level but it would be great for users to have a way to give feedback immediately.
After looking for an appropriate pattern I found rating as the most fitting one so our team can use this data as an indicator for tuning prompts going forward. We of course see value in the form, but presenting that as the first step seems overwhelming for someone who only intends to share a quick reaction and also deviating from the pattern used in other AI services.
We're not receiving good number of feedback. The usage is definitely increasing but the form isn't used as often for sharing feedback in our opinion and we want to make it simpler for our users.
I can get on a meeting with you to discuss more about our approach here and also the options for testing these mechanisms with Duo users.
@rutshah in my 1:1 with @veethika I shared how we are hoping to approach the measurement of outcomes for Duo Workflow.
Overall value: ≥ X% increase in value compared to non-AI approach
Primary evaluation: Avg. usefulness score from AI UX survey
Secondary evaluation: Avg. effectiveness score from AI UX survey
Speed: ≥ X% reduction in time to complete workflow Y compared to non-AI approach (source: MR for AI feature benchmarks).
Primary evaluation: Instrumentation
Secondary evaluation: KLM applied by UX DRI to predict a skilled user’s task time (error-free) to within 10-20% of the actual time.
Reliability: ≥ X% decrease in [something unwanted in workflow, e.g. attempts to fix pipeline] compared to non-AI approach.
Efficiency: ≥ X% decrease in [something that causes waste/inefficiency, e.g. cost of running pipelines] compared to non-AI approach
Some of this is based on how we define outcomes for Jobs To Be Done (speed, reliability, and efficiency dimensions). Most of these measurements are quantitative, only the “overall value” is qualitative. For the “overall value”, something like the AI UX survey will likely be more reliable and efficient than in-app user feedback. But my understanding is that decision-makers are more influenced by quantifiable outcomes, not user-reported metrics. Happy to chat more about this if you'd like.
But my understanding is that decision-makers are more influenced by quantifiable outcomes, not user-reported metrics. Happy to chat more about this if you'd like.
@rutshah I revisited the proof of value metrics listed in https://gitlab.com/gitlab-com/packaging-and-pricing/pricing-handbook/-/issues/542#ask and in hindsight I don't think cycle time and success ratio would be enough to provide a clear quantitative indicator of RCA's performance and usefulness. Cycle time has a lot of influencing factors and pipeline failed/success ratio is not really representing if pipeline went from failed to success. How about we include one of the following metric compare b/w RCA and non-RCA users beside recording individual feedback?
Compare b/w RCA and non-RCA users
Avg time to successful pipeline for each commit/event
Time to flip status from failed to success- I assume since we send notification for failure and then change in status to success, we can record the avg time b/w the two statuses.
What level are are you proposing this metrics at (e.g. individual pipeline, project, namespaces?)
My underlying value hypothesis for RCA has been that without RCA you will take X iterations to get to a successful pipeline and with RCA you will take X-N iterations. So the metric that helps us quantify that reduction in iteration could be used.
without RCA you will take X iterations to get to a successful pipeline
@rutshah from our research we learnt that the path to figuring out the right fix is a struggle in itself. the real value RCA add is crunching 7 steps into one. So despite multiple iterations, it can make the time to success shorter. This is the reason I strongly feel we should measure time here instead of iterations.
What level are are you proposing this metrics at (e.g. individual pipeline, project, namespaces?)
At project level level should be good. The dashboard I'm envisioning is a chart showing Average time to go from failed to success status for jobs in projects with RCA enabled vs not enabled. WDYT?
@veethika - Here's the current proposal on different ways we can be measuring RCA (#480070) which includes what you have proposed.
Coming back to this issue - are we now concluding that we don't want to collect individual response feedback and will rely on other ways to understand the RCA impact?
@rutshah we are still adding the contextual feedback. I've arrived at a proposal that should work for us. Will share with you today and update on this issue.