In addition to Development, Quality, and UX’s collaboration with PM (i.e. the Quad). We have needs for prioritization to help Infrastructure and Security.
These topics are already at the top of our prioritization table (https://about.gitlab.com/handbook/product/product-processes/#how-we-prioritize-work) but I consistently hear that we’re not executing enough of this work. Is there a place in this Quad OKR planning to cover those topics and get those departments’ voices heard? CC @johnathan Hunt @steve Loyd
OH my, this is a great topic, I'd love to get infradev + security voices into the product dialogue more. I responded in this comment on how I could see it working, although, I am interested in seeing the inputs.
1
Anoop Dawarchanged title from Planning as a team - to Planning as a team - including Security and Infra
changed title from Planning as a team - to Planning as a team - including Security and Infra
@adawar I think we really need to see Product play a more active role in prioritizing infradev issues within the context of workload management in Dev teams. By this I mean that we see EMs and their teams doing the impossible to get to these issues (something for which I am deeply thankful for), but they generally need to find the space to do so. A deeper understanding of these issues should help make this a steady workflow that relies less on individuals nudging EMs and more on an overall rhythm.
@gerir - is there a particular pattern you are seeing when issues fall through the cracks or not receive the required attention? For example do you find it harder to engage on the urgent items, or are the S2/S3 ones where things tend to continue to slip or not be prioritized? I flipped through the infradev board, but that is a point in time snapshot for me.
@awthomas - what are your thoughts here, and what do you see?
A couple ideas which may or may not be any good:
Urgent issues: Could we dogfood GitLab's Incident functionality, and perhaps the pager support to auto-notify the group in Slack and/or @mentions the proper people based on the assigned label? Maybe it could follow up / escalate as well if not ACK'd. This is basically a light-weight "carry a pager" idea.
Lower priority issues: Create a planning issue template, which includes links to the relevant boards / queries for each of the product prioritization items. Goal is to remind the quad to go through these when they are reviewing candidates each milestone. We could also leverage a bot, similar to what we do for security, to remind the team of any SLO's.
Finally, and we may have this, but some way to easily view outstanding availability/security/etc in aggregate and broken down by group/age to see if we have any hotspots.
I don't think the solution is as simple as just increasing the table in that section to 6 boxes from 4. The input/influence from Infra and Security is different from the existing quad, but does lack sufficient representation. Maybe also building out more of what and how the quad actually works can inform what we should change further. For example, I'm interested to understand where Infra should be more present in order to improve influence on planning.
@joshlambert My sense is that during planning, it may be difficult for PMs to weigh features that have a quantifiable impact on adoption against infradev/tech debt items where sometimes the risk of not tackling them is hard to quantify. I do think it is worth us thinking about standardized metrics we could adopt to better quantify the impact of some of these items so that PMs can make an 'apples-to-apples' comparison of tech debt and features during planning. Here are some examples of the types of metrics we could consider: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10982#quantifying-business-riskimpact. Would love to hear if there are other metrics we should consider as well.
That combined with accountability through OKRs/team goals around the operational aspects of features, and tracking of open items via dashboards/bots would go a long way towards solving it.
Perhaps we can start incorporating the Stage Group Dashboards or Error budget dashboards (if this is the correct dashboard) into planning. There are some pros and cons to this approach.
Pros
It gives us a big picture of the health of the application, feature, stage, etc
We are regularly and proactively looking at the health of our application
Cons
Individual issues are not identified, it requires further investigation
There's a lot of information on these dashboards and may require some refinement for planning purposes
Perhaps by incorporating these dashboard it will also inspire some monitoring features that we can add/dogfood into our application.
During planning, ensure that there is always a % of the work that focuses on tech debt/infradev. I will use the term tech debt in the rest of this example but we can use a different name as needed. In previous companies I've implemented loose agreements at approximately a 70/30 split (features to tech debt). One pros from this approach is that we are carving out time to be proactive in addressing some of these looming tech debt. During planning and grooming can discuss the impact of not doing this work and how it might negatively impact the product down the road. In this scenario, work is surfaced/identified by the engineers, placed in the backlog and labeled accordingly (e.g. ~"tech debt"). The engineering manager(s) then work with the team and product counterparts to prioritize and schedule this work, including Quad Planning perhaps. This type of coordination and communication should be taking place now anyway, but calling out the % set aside for tech debt can help to streamline this process.
There can be cons to this approach too, such as being rigid about the % split. They should be viewed as guidelines and adjusted as a team.
This does not preclude or remove the need for the ~infradev process we have in place. Things will still arise that will require immediate attention.
I personally used this approached @craig-gomes while I was the PM over Release Management. So generally supportive for product areas that are more mature.
This pattern may not be effective for new categories.
A potential explanation for an increase in infradev/security issues is that we've accumulated technical debt in many areas, which is now starting to compound negatively. As @jreporterpointed out, this is likely different across categories and we should acknowledge that.
I personally like Will Larson's model for engineering teams with four different states. My suspicion is that we are experiencing more teams falling behind or treading water, which then in turn leads to more of the issues we are discussing here. I think this is pretty normal given our scale and two immediate solutions would be to
Hire more folks in areas that really need more capacity
Increase focus and reduce concurrency for teams that are treading water
As PMs we can definitely help with the second item and advocate for the first one if needed. PMs should, in collaboration with the other stakeholders, define iterative improvements to address specific areas within their categories that have accumulated technical debt. This process is pretty similar to any other work that goes into the product development process. Rather than focusing on new features, this focuses on reducing debt and items like this should be validated, broken down and prioritized against other work. Maybe a 70-30 cut works for some teams, as @craig-gomes pointed out, maybe others require more or less. In any case, we should avoid situations in which value is delayed until "the large migration" is done a year later - usually we can iterate on high-value items faster.
PMs can also help to create focus in teams by reducing the amount of work that happens in parallel.
To stay with Will Larson's model, I see the quad DRIs responsibility to ensure that teams move into the "innovating state" and that will take time. If we don't drive these changes through our overall teams, we may just end up playing wack-a-mole.
Firstly, we're seeing more infradev issues being created. February is also exceptionally high for the creation of new infradev issues. The reason for this might be the large number of incidents raised in January and February of this year since infradev issues are often raised as corrective actions for incidents.
Secondly, we are seeing fewer infradev issues closed on average each month. I don't have an answer to this.
It is possible that with an accumulation of issues in a few stage groups that they are simply unable to prioritize these issues alongside the other work that needs to be done.
[EDIT] Changed the colour of the priority levels to match the label colours
What are we going to do about it?
Each team has a specific place in the devops lifecycle and has challenges that are unique to them. So it is hard to recommend a process that will meet everyone's needs. Teams need to be able to choose the methods that work for them. But that also means they need to know what success for their team looks like to know if their methods are effective. Do the current measures of success include anything about security, infrastructure, incidents, or GitLab.com?
@rnienaber Thank you so much for adding data to this!
Firstly, we're seeing more infradev issues being created. February is also exceptionally high for the creation of new infradev issues. The reason for this might be the large number of incidents raised in January and February of this year since infradev issues are often raised as corrective actions for incidents.
That makes sense. Thanks!
Secondly, we are seeing fewer infradev issues closed on average each month. I don't have an answer to this.
Me neither - a question is if infradev priority is sufficiently understood in all areas. Maybe there is an opportunity to highlight in a PM meeting so everyone knows why this is important.
@rnienaber the color coding in the Open issues by group is the inverse of our label colors, correct? So a P1 in your plot is green, whereas the label color is red priority1? My brain just needed to process this correctly
But that also means they need to know what success for their team looks like to know if their methods are effective. Do the current measures of success include anything about security, infrastructure, incidents, or GitLab.com?
Not sure but @jreporter started #2095 (closed) which may make it easier to include these as KRs for groups with many infradev issues.
@fzimmer I was so focused on how many NULL entries there were that I didn't notice the colours were backwards! I've updated the image so that it doesn't trip others up too.
Maybe there is an opportunity to highlight in a PM meeting so everyone knows why this is important.
I agree very much. I really believe that the Quad prioritizes the issues that they understand to be the most important and I think a lot can be done to improve how we communicate the impact an issue will have - beyond just labelling issues as P1 or S1. As much as people can try to explain importance better, it's also necessary for others to ask questions if they don't understand why something is important.
Do you think we aren’t labelling the same way?
It seems that the Infradev label has always been applied in the same way.
Perhaps we are identifying more issues?
The data shows that we are. But this needs to be seen in the context of the increasing rate of change. The MR rate below shows that we have steadily increased the number of MR's that we produce.
And the MTTP (Mean-Time-To-Production) rates show that we ship these changes to production at a faster rate. The first shows from Novemeber 2019 to February 2020, and the second shows from March 2020 to current date.
This shows that we have changed how we put our work onto the production systems.
The question is, have we done enough to cope properly with the result of the increased rate of change?
It seems that the Infradev label has always been applied in the same way.
Got it, @rnienaber I was making sure we haven't added any GitLab Bot triaging or new automation. Thank you for this additional context it is very helpful!
@gl-product - Please provide your feedback. Would like to listen to your ideas as we figure out a good way to include security and infra requests into our mainstream prioritization workflow.
@adawar@sloyd - My initial suggestion would be along the lines of what Steve suggested below. We've had a structure for assigning stable counterparts from Infrastructure to Product Groups(example) - but never done so beyond the manager level. Starting there would help bring a specific voice, with context to each product group for the execution that impacts infrastructure and critical product abilities that are missing today.
This is at the heart of "DevOps" teams, and it would be an accomplishment if we modeled the kind of shared stable counterparts that we suggest are optimal models to our customers.
Doing so would have the slightly disturbing effect of turning our Quad into a Pentagram.
Note - We do this today with the Security team by having assigned App Sec Engineers for each group which might be why this issue's discussion has tended to focus on Infra, not Security issues.
@adawar For my group specifically, visibility isn't a problem. I'm aware of the issues since I've been tagged by infra team (thank you ) and have good information on business impact. The underlying cause for Access issues is GitLab.com growing rapidly and exposing issues scaling certain architecture components. We have to slot these issues along with all other workstreams so prioritization can be tough. I'm including at least one item per milestone to continue to make progress.
@adawar - For the groupgitaly, we're tagged directly so this has been standard operating procedure for us so far. I'd be happy to make things more official, but I don't currently think that we're missing prioritization for Ifra and security requests.
@adawar From my perspective I would say that I've generally been aware of and tried to handle issues that are brought from those groups back to the groups I've been a PM of.
On the flip side, however, I don't know that PMs have enough insight in to how to work with those teams when we do need assistance or how to properly get work scheduled with those teams if there is something we need.
@adawar - for grouptesting we're dedicating 20% of time to ~feature::maintenance issues specifically but would probably assign infra issues there as well. For any security issues we following the existing Priority/Severity guidelines.
re: understanding business impact - For me it's best to break these down into one of two categories. 1) Does this fix something broken/causing pain or 2) is this preventative. If the latter I ask for impact of not doing it and likelihood of experiencing the pain as well as LOE for the work. It may be worth a couple small issues being pulled in to prevent a potential week of downtime for instance.
Thus far I have found no ~infradev issues for my group so if there are any it would be an awareness issue.
At this time, before this issue, i was only aware of bugsecurity and ~performance as items i neded to watch, I will add ~infradev to the list
SCA specifically has one BE dedicated as reaction rotation each milestone and that person is responsible for triaging incoming bugs/security issues / dependency updates etc. and addressing the high priority ones.
Note: @gonzoyumo should we make sure we specifically add ~infradev to the bug triage process (step 4) or even as a step 5 as we now have an additional label to monitor?
This reaction rotation has led us to usually being able to address all P1 / S1 bugs and P1 / S1 security issues of P1 and S1 value promptly, though P2 and S2 and below suffer unless I should place more BE time against them.
We are of course RCAing items and working to improve testing to reduce bugs as well as part of a holistic path to have less incoming items.
As such unless someone can indicate where and if we are failing to meet the requested priorities (where stability and security and bugs are above new features), or indicate I must more rapidly address s2/p2 I have no plans to change groupcomposition analysis at this time @adawar cc @david
I think @NicoleSchwartzhas a great point above, we should consider including ~infradev as a triage bot label to watch so that we all get it included in the weekly bug report. Looking at the infradev lable it doesn't look like we consistently have bug or priority/severity set on them so they may slip through the existing triage bot issues.
Thoughts before i go track down how to modify that triage bot?
@adawar I'd recommend a known working solution: ownership. Out of the many ideas provided here, @tmccaslin is the closest one to make the PMs own their area not only until a feature is being released, but even in production.
In a sense this is a move towards a DevOps or SRE model, but it takes into account that this is not possible at GitLab as we have a monolith with a very decentralised organisation (pretty weird for Conway's Law).
@cdu1 oh I should have clarified, I believe that Infradev issues already have logic with severity applied due to them also having labels like ~availability, bug, ~performance attached to them. I’m looking at the board, those issues in fact have multiple labels where there would potentially be triage-bot conflicts if the Infradev label had its own triage bot severity/priority.
I did a few spot checks of infradev issues, and while some have ~missed SLO attached, it seems to predate the infradev label. It doesn't seem like triage bot is active on these.
I'd vote for a boring iteration of adding infradev to the triagebot, as weekly triage reports, if it isn't already.
I believe that Infradev issues already have logic with severity applied due to them also having labels like ~availability, bug, ~performance attached to them.
This way we don't require additional effort on the Infra team to put a whole bunch of labels on issues when dealing with incidents. I think it's also worth breaking out infradev specifically in the weekly triage report, as the bug section can be large for some groups.
This is an interesting discussion, and there are a number of good ideas there which would allow us to resolve some of the issues we're seeing with infradev.
I do not believe that Stable Counterparts from Infrastructure into stage groups scales, we tried it before and it did not work. We next tried it on the management level, and that is also not scaling. The problem here is that Infrastructure team is just too small to be able to run GitLab.com and participate in every single feature. Infradev process is something that we do as a reaction when a problem becomes too big to handle. Extending the Quad Planning, to Quad+2 will only help somewhat.
We do not have to run before walk, we don't have to immediately create DevOps teams because that is a change that is cultural shift more than anything. There is a step we can take before which will naturally push us to improve, similar to how CD on .com made us think differently about shipping code.
I believe that the solution for our current situation is to start talking about how do we ensure that a severity 4 issue does not turn into severity 1 issue by applying a focus when necessary. Other companies larger than us already solved this through the concept of contracts between stakeholders where you have an indication on when you need to focus on improvements vs. shipping new features. At companies like Google, they call it error budgets.
Perhaps we can start incorporating the Stage Group Dashboards or Error budget dashboards (if this is the correct dashboard) into planning. There are some pros and cons to this approach.
The Scalability team has been working on this approach with the goal that error budgets become a contract between Product and Engineering (Infra, Development, Quality, etc.) on when to focus on technical improvements vs. feature shipping. The stage group dashboards are built to inform individual feature teams on how their stuff works in production on the day to day, with the goal of placing budget spend on the stage group level.
What does this mean?
This means that if the team is not spending their allocated budget (their features work correctly), they can focus on feature shipping. When the budget spend increases, the teams have to focus on what is the cause of the spend, and this way they can target when to reach out to all stakeholders asking for help prior to issues developing into high severity urgent issues. This approach would give Product insight in how they should plan their prioritisation schedule proactively, rather than waiting for an incident or someone else to ask them to shift everything.
I urge us to think about how to adopt this approach, because this is industry standard way of dealing with the contention between feature shipping and product performance/security/stability.
Who is the DRI for Error budget spend enforcement@marin? I would also be interested in how do we make sure that Product is proactively informed enough in the budgeting cycle to adequately prioritize scope. Are we relying on severity SLOs mainly?
@jreporter we did not define the "enforcer" as a person, but our SLA for GitLab.com . Basically, Infrastructure keeps bringing up our availability and budget spend for GitLab.com and based on that teams should take their own actions.
We are still early in this journey as we are still building out the necessary dashboards, but I am of an opinion that once we have this completed, we can hold every stage accountable for their own spend and asking the questions on why the situation is deteriorating (for example).
@marin ah, right, I think you might be hitting on what I am asking. Degradation of this metric would be lagging. Are there any leading indicators we can consider?
Degradation of this metric would be lagging. Are there any leading indicators we can consider?
@jreporter Excellent observation and I am so happy you've asked this! We are building out a measure of "failures" with "mean time between failures". The objectives for that measurement are aspirational so we are holding ourselves to a higher standard than what we promise to our customers. If the service starts slipping, we will see that in the numbers there and it will give as an indication of where we need to focus next. The Scalability team is responsible for that metric and it is possible for the team to give a heads up prior to the budget going into the red.
Put the other way, while we measure SLA/error budget based on what we expect our customers to experience, we are holding ourselves to higher standard so that we can get better at predicting.
Thanks @marin - the error budget framework is great. Do you think the stage-level error budget dashboards are ready to put more process around? We could start as lightweight as encouraging PM's to look at them prior to scheduling each milestone, along with infradev for example.
Another place to look for emerging performance problem detection would be in the #mech_symp_alerts channel. As the channel title states This channel contains alerts of badly performing endpoints in production. Help raise issues and squash them!
This topic came up during the Verify DB Rapid Action because a number of the problems listed in the rapid action were previously known problems that were deprioritized. If the goal of this issue is to be truly proactive about lagging performance then #mech_symp_alerts would be a good channel to monitor to catch troublesome endpoints early.
Thanks @marin - the error budget framework is great. Do you think the stage-level error budget dashboards are ready to put more process around? We could start as lightweight as encouraging PM's to look at them prior to scheduling each milestone, along with infradev for example.
@joshlambert The stage level error budgets are not too far off, and with the comment in #2185 (comment 527563723) we can prioritise it higher given the impact it can have.
@jreporter I've raised an epic where we can get started with introducing error budgets to the stage groups. gitlab-com/gl-infra&437 (closed). We can continue the conversation about linking error budgets to Product PI's on the related issue that you've raised: #2295 (closed)
Thank you team for all the feedback. I'm closing this issue now as we have made a few changes (see related MR's above) and we will monitor how these changes materially impact what we do.
@adawar - if anyone has any feedback related to these changes or this issue I created this retrospective issue, I have assigned it to you and @sloyd. We can close that issue next month if no feedback is rendered.