Limit Infradev label to Severity 1 and Severity 2 issues
Why is this change being made?
I did a quick'n'dirty analysis of Infradev issues. More details can be seen on this Sisense board - https://app.periscopedata.com/app/gitlab/834407/Draft:-Infradev-Issues
Observations
- Outstanding (open) issues keep growing since July 2020.
- The growth of S3+S4 outstanding issues outpaces S1+S2 issues.
- As of 2021-03-17, S1 = 6, S2 = 19, S3 = 26, S4 = 14.
- Average create-to-close time is 11 days and decreasing slightly
Thoughts
- The Infradev process aims for prompt traction and fast resolution of critical production issues.
- It feels like the process has been losing its edge when serving the above goals compared to its early days. It may be due to the sheer volume of S2/3/4 issues that exceeds the capacity of “just do it” and these issues had to fallback to the normal scheduling.
- The decreasing create-to-close time didn’t help reducing outstanding issues
Suggestions
- Scrub the existing outstanding issues and classify issues into 3 buckets by answering “when should it be resolved from the production perspective?” based on the Severity Guidelines and SLO Guidelines.
- 1-week (Severity 1)
- 1-month (Severity 2)
- More than 1-month
- Set 1-week issues to S1/P1 and keep the Infradev label.
- Set 1-month issues to S2/P2 and keep the Infradev label.
- Remove Infradev labels from all other issues.
- Infradev labels are only applicable to Severity 1 and Severity 2 issues moving forward.
Based on the suggestions above, this MR opens a discussion to make Infradev
label applicable to Severity 1 and Severity 2 issues only.
Author Checklist
-
Provided a concise title for the MR -
Added a description to this MR explaining the reasons for the proposed change, per say-why-not-just-what -
Assign this change to the correct DRI - If the DRI for the page/s being updated isn’t immediately clear, then assign it to one of the people listed in the "Maintained by" section in on the page being edited.
- If your manager does not have merge rights, please ask someone to merge it AFTER it has been approved by your manager in #mr-buddies.
-
If the changes relate to any part of the project other than updates to content and/or data files please make sure to ping(this requirement has been removed pending identification of a new DRI for the handbook)@gl-static-site-editor
in a comment for a review and merge. For example changes to.gitlab-ci.yml
, JavaScript/CSS/Ruby code or the layout files.
Merge request reports
Activity
- Resolved by Chun Du
@cdu1 my main concern here is that a large number of "more than 1 month" issues still lead to a build up of technical debt.
In fact, our current situation could be considered to the aggregation of many "will fix later" technical debt issues. CTE namespace queries have been a well-known concern since 2018 (and there are issues about this going back that far). Likewise, focus on pipeline validation service was never considered to be something we needed to fix within one month, until suddenly the price of bitcoin shot up and we needed it urgently.
Technical debt, like financial debt, accrues over time. To extend the analogy, lots of small loans are no less expensive than one big loan, and are in some ways worse because they're less obvious and more hidden, making navigating through a major issue more perilous, like stepping through a minefield.
Additionally, smaller, less urgent infradev items will often prevent us from being able to execute workaround options for more serious issues.
For example, in our current situation, had better abuse controls been in place, we would have been able to use that to reduce the impact of the CTE namespace queries by applying more stringent controls through the pipeline validation service, but that too was a "more than 1 month" issue, so our options at workarounds are limited.
I think we should maintain an ability to have InfraDev issues be of any severity. As is pointed out in other comments, S3 and S4 issues can and do become S1 and S2 issues (or incidents) with time if not tended to as S3 and S4s.
Could we perhaps approach the challenge with these issues by:
- re-instating the InfraDev meeting
- Adapt the work intake process for stage teams to include space for InfraDev issues
Edited by Brent Newton@cdu1 I agree with the concerns raised above. It seems that by limiting the infradev process to s1 and s2 issues, we are removing the only avenue we have currently to triage s3 and s4's. And I don't see a replacement in your proposal. I wonder if instead of this MR, we should bring this set of challenges into scope for the OKR we're working to develop a SaaS-first Framework. For example, how can we more generally reinforce the infradev process escalating s1s/s2s as needed while ensuring s3 and s4s get traction among stage groups and don't accumulate. Thoughts?
Edited by Andrew Thomas
@marin @andrewn @brentnewton @awthomas I see your points and I completely agree production issues deserve attention, nonetheless we are working towards SaaS First. This MR stems rather from a pragmatic perspective of effectiveness and efficiency, as you can see the charts https://app.periscopedata.com/app/gitlab/834407/Draft:-Infradev-Issues that the growth of open S3 & S4 issues outpaces S1 & S2 significantly, meaning pragmatically the priority was given to S1 & S2.
There appears a gap in the understanding of Infradev process. When initially started, it was designed as a fast-track to get production issues resolved quickly by wiggling the development capacity for just-do-it without PM's involvement, if I recall correctly. This is also my understanding up to the point when I proposed this MR. On the other hand, it's clear from your comments that Infrastructure views the Infradev process as a place to raise visibility of production issues and ask for prioritization into the backlog.
Overall, I believe we are on the same page - get production issues resolved. Many ideas and improvements have been discussed and implemented to make the process more effective, for example I added PM as stakeholders in !74622 (merged), and there were also discussions how PM can accommodate production issues in the planning Product#2185 (closed).
With that being said, I'm closing this MR with the acknowledgement of different perspectives of this process. While I was looking at effectiveness and efficiency pragmatically, the Infrastructure team weighs the coverage and visibility. The good news is that with the active involvement of PM, we are in a good position to satisfy all the goals above.
When initially started, it was designed as a fast-track to get production issues resolved quickly by wiggling the development capacity for just-do-it without PM's involvement, if I recall correctly.
I believe you are right, and this is what made the process quite successful in my view.
With that being said, I'm closing this MR with the acknowledgement of different perspectives of this process. While I was looking at effectiveness and efficiency pragmatically, the Infrastructure team weighs the coverage and visibility. The good news is that with the active involvement of PM, we are in a good position to satisfy all the goals above.
Thanks for searching for the solution @cdu1, this is much appreciated.