FY20-Q3 OKR Ops section: Increase Throughput => 0%

added 1 deleted label

changed the description

marked this issue as related to #4837 (closed)

changed the description

@DylanGriffith @jeanduplessis @nicholasklick @sethgitlab @ClemMakesApps @adriel @mnohr Could you please add some notes from the month of August from the discussions we have had in our 1-1s reviewing these metrics.

Let's follow this template:

What worked for us this month to drive execution in August
What did not work well
What we can improve moving into September
Notes

Thank you!

What worked for us this month to drive execution in August

On the Monitor Health group, our throughput was VERY consistent through August. From the week of July 29 to the week of August 26 our throughput numbers were: 9, 11, 10, 10, 10. This is fantastic in terms of leveling out our effort through the month instead of having a high-pressure rush to get everything merged the week of feature freeze like we used to. I think this is due to our new continuous delivery process for gitlab.com and to the better, more consistent backlog grooming we are doing in our weekly meetings.

What did not work well

Our overall throughput number was down in August (41) compared to July (49). It is hard to pinpoint a reason for this, but some factors may have been that we were onboarding two new backend engineers and one new frontend engineer and a lot of our work in the earlier part of August was investigative work for larger issues that didn't lead directly to MRs.

What we can improve moving into September

Hopefully the investigative work we did in August will help us gain momentum in September. Our engineers who started in the end of July and beginning of August are now ramping up quickly and that should also be reflected in our throughput. We will have one more backend engineer joining midway through September, so that may push our throughput down a bit.

Another thing to note is that for %12.3 we have more spikes (which don't lead to merged MRs as outcome), I am not certain we will see a higher throughput for September because of that. Something to consider.

Not sure I understand the spikes concern. Ideally this would level out the following week. We are not measuring release to release. Also we have a seen a general trend of leveling.

For monitor health, we have more spike (PoC investigations) that are quite large in size. As a result, engineers may spend more time investigating rather than having MRs merged into the codebase (which would indirectly decrease our teams throughput). A spike here and there shouldn't have a big impact but this release we have 3 spikes

Oh research spikes, not spikes in the MR commitments (which could cause batching issues). Maybe a good time to update the handbook for terminology?

We have it documented but I'm not sure if it's just internal to our monitor team at this point in time.

Here's our documentation in the handbook: https://about.gitlab.com/handbook/engineering/development/ops/monitor/#using-spikes-to-inform-design-decisions

For cross reference I've asked the question to my team before this to gain some insights so I might just be guessing now gl-retrospectives/configure#20 (comment 212673587)

What worked for us this month to drive execution in August

Having a backend maintainer means it's easier for us to get important things through to merged quickly

What did not work well

Several issues that required deep analysis before we could ship anything
Not necessarily a bad thing but we're taking on a new domain for us with EKS support so that involves gaining familiarity with a lot of new tooling and APIs in order ship minimal features

What we can improve moving into September

Hoping that September issues end up being more straightforward due to investigation done in August
Hoping we can avoid getting stuck on large issues with large time to first merge. Breaking things down and merging quicker might help with productivity just by motivation.

What worked for us this month to drive execution in August
- I'm finally up to speed and more involved in the scheduling and prioritizing of work for the Configure FE team to help them maximize their throughput.
What did not work well
What we can improve moving into September
- Keep driving the message of breaking down features into MRs that are as small as possible. The way Enrique did it on this issue is a good example: https://gitlab.com/gitlab-org/gitlab-ce/issues/46686#note_208124521
Notes
- 2/3 of FE team members away for a week at a conference
- Increase in leave taken due to public holidays and general holiday time for US region) (PS: The items listed in what did not work well are good things, which we should do and encourage, I'm merely listing it here as contributing factors to a lower throughput for Aug.)

@jeanduplessis I want to make sure we are clear that going to the items you listed as you said are good things and encourage. I am updating our template to include a new category: Notes where we can make highlights such as the ones you listed as well as other impacting factors such as on-boarding new engineers.

I still want us to be critical of what did not work well such as long review cycles, impact from long spikes/investigative work, etc. These are things we can work to improve.

Perhaps the section question can be something like "What negatively impacted execution and/or throughput?" - that way we're not making a value judgement by default.

For the Monitor/APM backend team:

What worked for us this month to drive execution in August

We did better at breaking down large issues into smaller MRs. For one extreme example, issue https://gitlab.com/gitlab-org/gitlab-ce/issues/56883 (a very large feature) was split into 18 smaller MRs.
We had 15-20 MRs that were merged within 1 or 2 days of opening.

What did not work well

We did have some MRs that took longer, such as this large MR that took about 22 days to get merged: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/31389
Finding reviewers/codeowners for some smaller projects has been difficult and made merging slow. We created this handbook MR with more details to track: !29057 (merged)
The APM team as a whole had high variation in the number of MRs week to week from a high of 16 to a low of 3.

What we can improve moving into September

Doing more investigation on issues before each milestone so we are sure they are ready for work when we start working on them.

Notes

The backend APM team did have some extra vacation time in August, about 10% of our time was vacation.
The APM backend team has 2 new engineers planning to start at the end of September, which could impact September and October, but will hopefully make for strong months going forward.

For APM frontend:

What worked for us this month to drive execution in August

We have agreed on the APM Planning board as our SSOT for prioritized work. This accomplished a number good outcomes:
- Sparked conversations in our 1:1s prompting ownership by engineers over the issues they want to work on while staying in alignment with product's priorities
- Allowed us to get a jump start on future milestone work when we encountered wait times due to blockers.

What did not work well

We are still digging our way out of the varying issues arising from different permutations of enabled/disabled feature flags.

What we can improve moving into September

We should prioritize some time to implement end-to-end (and ideally snapshot) tests for the metrics and cluster health dashboards. This would help us ensure we aren't breaking functionality as we continue to add new features and keep us aware of our blind spots as we introduce potentially breaking changes. It would also help prevent us from merging changes behind a feature flag that would break the dashboard when enabled.

Notes

Frontend had at least 17% of our time spent on vacation, plus another 8% of partial capacity due to JSConf.

For BE Serverless:

What worked for us this month to drive execution in August

We added 2 new BE which should greatly increase throughput going forward
Configure team provided great support to help onboard Serverless members. Thanks!

What did not work well

New BE require time and resources for onboarding and getting up to speed. This may have slowed down existing team members output.
Adding myself/EM along with 2 new BE at the same time was a bit challenging
Some slow moving MRs to CE: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/31790

What we can improve moving into September

My team needs to be aware that the review process for CE is lengthy and plan accordingly.

Notes

1 new EM
2 new BE added.

changed the description

Single Codebase Impact

For the Monitor:APM team I looked through the list of MRs for July and August to find how many were duplicated to apply the change to both CE and EE:

Month	Total MRs	CE/EE Dups	Unique Changes
July	41	7	34
August	42	7	35

Now that we have a single codebase, I would expect the total number of MRs to be closer to that "Unique Changes" count. For example, in September we had 33 MRs, which is close to the number of unique MRs from previous months.

That said, there are other factors as well we should consider. I think we be more productive going forward since we don't have the process overhead of opening duplicate MRs.

changed the description

Worth noting for Configure in September the Frontend team switched to labelling certain MRs as devops::configure only without using the group:: label. This is because these issues are not related to either of the teams and are part of a global optimization effort. Additionally since the FE team were working across both groups this also made more sense than arbitrarily choosing a group. This apparently corresponded to 11 MRs that in previous months would have been added to the Configure Orchcestration total.

After discussion in a 1:1 yesterday I've decided to ditch the separate labeling process we have in Configure FE and align it with the group throughput labelling.

Because we have internal alignment in the team to specific groups team members can act as if part of that group.

Therefore going forward (and I'll try to update MRs retrospectively for 12.3) the team members will use their respective group label for MRs, regardless of whether they do stage work or not.

changed the description

@nicholasklick - could you update this issue with your last month throughput, please? Let me know if you need any help.

@dcroft Should be updated!

changed the description

Peeps, what's your timing usually for adding the retrospective for our throughput metrics?

assigned to @dcroft and unassigned @dhavens

assigned to @dhavens

unassigned @dhavens

@sengelhard, @mnohr and @nicholasklick please add a todo for next week to update October throughput.

changed the description

Update 2019-11-28

Our overall trend was flat through FY20-Q3
Onboarding, delays in merge time were contributing factors
We did not hit our FY20-Q3 goal, we achieved ~90% of the 0% goal

In FY20-Q4 we're focussing on more clear guidance to engineers on breaking up work into smaller Issues and MRs, coaching on this front as well as introducing various methods to provide more opportunity for team members to execute on small deliverables to drive up throughput.

closed

added DepartmentDevelopment label

Group	Benchmark	August	September	October
Monitor:APM	41 (July)	42	33	27
Monitor:Health	49 (July)	41	37	47
Monitor - Total	90 (July)	83	70	74
Configure:Orchestration	98 (July)	75	69	72
Configure:System	N/A	9	11	17
Configure - Total	98	84	80	89

FY20-Q3 OKR Ops section: Increase Throughput => 0%

Designs

Child items ...

Activity

What worked for us this month to drive execution in August

What did not work well

What we can improve moving into September

What worked for us this month to drive execution in August

What did not work well

What we can improve moving into September

Notes

What worked for us this month to drive execution in August

What did not work well

What we can improve moving into September

Notes

What worked for us this month to drive execution in August

What did not work well

What we can improve moving into September

Notes

Single Codebase Impact

Update 2019-11-28

FY20-Q3 OKR Ops section: Increase Throughput => 0%

Relates to

Activity

What worked for us this month to drive execution in August

What did not work well

What we can improve moving into September

What worked for us this month to drive execution in August

What did not work well

What we can improve moving into September

Notes

What worked for us this month to drive execution in August

What did not work well

What we can improve moving into September

Notes

What worked for us this month to drive execution in August

What did not work well

What we can improve moving into September

Notes

Single Codebase Impact

Update 2019-11-28