Limit Infradev label to Severity 1 and Severity 2 issues

changed title from Limit Infradev issues to Severity 1 and Severity 2 to Limit Infradev label to Severity 1 and Severity 2 issues

changed the description

resolved all threads

added 1 commit

bc70611a - Apply 1 suggestion(s) to 1 file(s)

Compare with previous version

changed the description

S3 and S4s are accumulative and they do turn into higher severity issues over time. If we always only handle S1/S2s, we will always be handling the issues after they have become critical.

I do not think this is a good move, so I am wondering what is this change trying to resolve?

@cdu1 my main concern here is that a large number of "more than 1 month" issues still lead to a build up of technical debt.

In fact, our current situation could be considered to the aggregation of many "will fix later" technical debt issues. CTE namespace queries have been a well-known concern since 2018 (and there are issues about this going back that far). Likewise, focus on pipeline validation service was never considered to be something we needed to fix within one month, until suddenly the price of bitcoin shot up and we needed it urgently.

Technical debt, like financial debt, accrues over time. To extend the analogy, lots of small loans are no less expensive than one big loan, and are in some ways worse because they're less obvious and more hidden, making navigating through a major issue more perilous, like stepping through a minefield.

Additionally, smaller, less urgent infradev items will often prevent us from being able to execute workaround options for more serious issues.

For example, in our current situation, had better abuse controls been in place, we would have been able to use that to reduce the impact of the CTE namespace queries by applying more stringent controls through the pipeline validation service, but that too was a "more than 1 month" issue, so our options at workarounds are limited.

@cdu1

I think we should maintain an ability to have InfraDev issues be of any severity. As is pointed out in other comments, S3 and S4 issues can and do become S1 and S2 issues (or incidents) with time if not tended to as S3 and S4s.

Could we perhaps approach the challenge with these issues by:

re-instating the InfraDev meeting
Adapt the work intake process for stage teams to include space for InfraDev issues

@cdu1 I agree with the concerns raised above. It seems that by limiting the infradev process to s1 and s2 issues, we are removing the only avenue we have currently to triage s3 and s4's. And I don't see a replacement in your proposal. I wonder if instead of this MR, we should bring this set of challenges into scope for the OKR we're working to develop a SaaS-first Framework. For example, how can we more generally reinforce the infradev process escalating s1s/s2s as needed while ensuring s3 and s4s get traction among stage groups and don't accumulate. Thoughts?

@marin @andrewn @brentnewton @awthomas I see your points and I completely agree production issues deserve attention, nonetheless we are working towards SaaS First. This MR stems rather from a pragmatic perspective of effectiveness and efficiency, as you can see the charts https://app.periscopedata.com/app/gitlab/834407/Draft:-Infradev-Issues that the growth of open S3 & S4 issues outpaces S1 & S2 significantly, meaning pragmatically the priority was given to S1 & S2.

There appears a gap in the understanding of Infradev process. When initially started, it was designed as a fast-track to get production issues resolved quickly by wiggling the development capacity for just-do-it without PM's involvement, if I recall correctly. This is also my understanding up to the point when I proposed this MR. On the other hand, it's clear from your comments that Infrastructure views the Infradev process as a place to raise visibility of production issues and ask for prioritization into the backlog.

Overall, I believe we are on the same page - get production issues resolved. Many ideas and improvements have been discussed and implemented to make the process more effective, for example I added PM as stakeholders in !74622 (merged), and there were also discussions how PM can accommodate production issues in the planning Product#2185 (closed).

With that being said, I'm closing this MR with the acknowledgement of different perspectives of this process. While I was looking at effectiveness and efficiency pragmatically, the Infrastructure team weighs the coverage and visibility. The good news is that with the active involvement of PM, we are in a good position to satisfy all the goals above.

When initially started, it was designed as a fast-track to get production issues resolved quickly by wiggling the development capacity for just-do-it without PM's involvement, if I recall correctly.

I believe you are right, and this is what made the process quite successful in my view.

With that being said, I'm closing this MR with the acknowledgement of different perspectives of this process. While I was looking at effectiveness and efficiency pragmatically, the Infrastructure team weighs the coverage and visibility. The good news is that with the active involvement of PM, we are in a good position to satisfy all the goals above.

Thanks for searching for the solution @cdu1, this is much appreciated.

closed

Limit Infradev label to Severity 1 and Severity 2 issues

Why is this change being made?

Observations

Thoughts

Suggestions

Author Checklist

Activity

Limit Infradev label to Severity 1 and Severity 2 issues

Why is this change being made?

Observations

Thoughts

Suggestions

Author Checklist

Merge request reports

Activity