Exploring how to work with Stage Groups
Preface
I've been observing the conversations we've had recently with the stage groups concerning the sidekiq workers that need attention. One thing that was common was that they didn't know the workers were performing poorly. Or they thought it wasn't as bad as what we saw.
It got me thinking that if we want to shift "performance" or "scalability" right in the development cycle, we need to equip stage groups to think like a scalability team. But how do you do that across the entire product organization?
Then, @andrewn told me about the Google SRE book. And I've only read the first chapter (danger!) and I have an idea.
The first part is that there is reference to the CALMS framework (https://www.atlassian.com/devops#culture), and the key piece at the end is Sharing. In context, this is taken to mean sharing the responsibility of operating one's product, but the SRE book frames it as "Sharing and Collaboration".
So how do we share what we know, and how do we collaborate with the teams?
The second part is about how in Google, the product teams need to "earn" SRE support. If your product is really successful, then you get SRE support as a result of your success. They will help take the product even further by bringing in certain aspects of scaling expertise.
So how do we target where our time is going to help product the most?
Also, I'm not saying that we don't help infrastructure - that is a separate (but related) concern. For now, this is just about working with stage groups.
The first question - how do we share what we know, and how do we collaborate with the teams?
-
We have to understand what teams already use to get information about how their code runs on production. Are they using the graphs and dashboards? Do they know they have access? We have to get the teams to a baseline where they understand how to see their work running, and how to understand what they see.
-
We have to make sure that conversations about what they see are regular and that they involve the two key decision-makers for a stage group - the Engineering Manager and the Product Manager. We need these two stakeholders to understand what they see, and to know how we (Scalability) interpret the same data.
-
Once there is an understanding about how their systems are operating, we give them the tools and guidance to help them build products that scale.
So this is about setting up a relationship with a stage group, giving them information, and providing support.
This might be practically manifested is by creating a "pack" for each stage group that consists of their dashboards, and advice tailored to their situation. It needs to include information about SLO's and error budgets, and explain all the concepts they need to understand how we (Scalability) determine what is important. We meet with them regularly with a decreasing frequency to provide support but also to consistently send the same message - that we want to help them deliver products that users love, and users love products that are fast.
So how do we target where our time is going to help product the most?
I propose that we start with finding the teams who are the heaviest users for different aspects of production. We already know who those are for sidekiq. Let's choose the top 3 groups to start. This is why I'm looking for lists of things and who owns them. I want to go to a stage group and say "here are all the things that you own, and this is what they look like in production", rather than going in piece by piece.
We could also start with group::access since they own authorized_projects.
That sounds like a lot of work...
Well, it is.
We're talking about setting up tailored relationships with the largest stage groups and providing them with information specific to them. And I'm hoping to set up something that is more than "we found this, you fix it" and lobbing things over the fence between teams. This is about builing that bridge between Infrastructure and Product Development, and building bridges and relationships takes time.
I also don't have all the answers yet - this is just the idea in it's rough form, and I'd welcome any feedback.