This is an asynchronous retrospective for the 13.12 release, following
the process described in the handbook.
This issue is private (confidential) to the Manage:Optimize group, plus anyone else
who worked with the group during 13.12, to ensure everyone feels
comfortable sharing freely. On 2021-05-26, in preparation for the R&D-wide
13.12 Retrospective, the issue will be opened up to the public, as long
as everyone is comfortable with this. You're free to redact any comments that
contain information that you'd like to stay private before that date.
Please look at back at your experiences working on this release, ask yourself
what praise do you have for the group?,
what went well this release?, what didn’t go well this
release?, and what can we improve going
forward?, and honestly describe your thoughts and feelings below.
For each point you want to raise, please create a new discussion with the
relevant emoji, so that others can weigh in with their perspectives, and so that
we can easily discuss any follow-up action items in-line.
If there is anything you are not comfortable sharing here, please message your
manager directly. Note, however, that 'Emotions are not only allowed in
retrospectives, they should be encouraged', so we'd love to hear from you here
if possible.
Issues we shipped
More issues - this list only includes deliverables!
We did not enable the feature flag in %13.11 as planned, and we were surprised at the beginning of %13.12 when we realized it had slipped.
We merged the release post MR before the feature flag was enabled by default. As a result, users were told the feature was available but couldn't find it.
Our engineering efforts need to be more detail-oriented and proactive than this. I include myself in that – I didn't mark the feature flag enabling as a blocker on the release post MR, and I didn't prep @wortschi to handle his first release post before I went OOO. Let's all make an effort to double-check our work for the next few weeks.
Bigger picture, this many mistakes feels like more than a coincidence. Why were ALL of us failing at the same time? Normally when one of us misses something, another will notice. Was there something strange going on with us a few weeks ago? Contagious brainwaves? Alien abductions? I suspect the overlap between the instance-level and group-level features is partially to blame, because it was hard to track their statuses separately. But that wouldn't explain everything...
Thanks for adding this @djensen. It does feel like there was some communication missing or some misunderstandings here. Please let me know if there is something I can be doing differently to avoid situations like this in the future.
I opened the issue to track the group level feature flag but didn't announce that it wasn't enabled by default by the end of 13.11.
Perhaps going forward we can discuss feature flag status issues and their outstanding requirements as part of our walk the board section in the weekly? This could help ensure that we're all on the same page and know what's still outstanding before opening a release post item?
discuss feature flag status issues and their outstanding requirements as part of our walk the board section in the weekly?
I think that's a good idea. The easiest way to make sure that happens is to make sure that feature flag rollout issues are prioritized as either ~priority::1 or ~priority::2, because we review every single one of those issues. Happily, that makes conceptual sense too - a feature flag enabling is a big deal, and we should have high confidence in it, meaning 1 or 2 is the appropriate label.
gitlab-org/gitlab#299606 (closed) states that we want to fully refactor Segments into "enabled Groups" before we enable it by default. The refactoring is scheduled to 14.0 as a braking change for the API. All DevOps adoption pages use the same API so this is the reason why group-level wasn't enabled by default and was treated as beta too.
Literally every feature can be initially behind a feature flag, so we always need to think about it when writing release post and considering feature as "done and released for general audience".
Don't merge release post until you see the feature working in production?
Be more explicit what is beta and what is not. As it turned out the fact that 2 devops adoption pages shared the same API wasn't clear.
P.S. Entire "this feature is beta" thingy could be avoided if we agreed with Sid or any other person who has high influence on decisions before introducing entire Segments abstractions.
discuss feature flag status issues and their outstanding requirements as part of our walk the board section in the weekly?
The easiest way to make sure that happens is to make sure that feature flag rollout issues are prioritized as either ~priority::1 or ~priority::2
That's a good idea. Adding a reminder to our weekly agenda to check the status of feature flags as we walk the board seems not too intrusive and could help us in quickly identifying similar situations in the future
With regards to @pshutsin's suggestion, I think we should still allow merging release posts without the feature being deployed to production. By preventing release posts from being merged before a feature is in production, we would either give developers less time to implement a feature or hold off with release posts until the next milestone. I think both options are not ideal
To be honest, I don't know the exact reason why the rollout of Group-level DevOps Adoption was causing so many problems but I'm assuming that it was a combination of all the reasons mentioned above and the fact that probably nobody felt responsible for communicating the actual state of the feature flag. Thus, I'm thinking that we need to either define a DRI for a feature flag (e.g, the BE or FE dev) or we keep the rollout issue as the Single Source of Truth as suggested by @ekigbo in gitlab-org/gitlab!60437 (comment 562260787).
Perhaps going forward we can discuss feature flag status issues and their outstanding requirements as part of our walk the board section in the weekly? This could help ensure that we're all on the same page and know what's still outstanding before opening a release post item?
@blabuschagne We (PMs) are required to have all release post items created by the 10th of each month. We have until 17th to merge them. I think blocking the release post issue with the feature flag issue could help.
The easiest way to make sure that happens is to make sure that feature flag rollout issues are prioritized as either ~priority::1 or ~priority::2
@djensen Yes, I would support making feature flag issues ~priority::1
All DevOps adoption pages use the same API so this is the reason why group-level wasn't enabled by default and was treated as beta too.
I think this is the crux of the mix up. I didn't realize that we were blocking group-level enable by default with the API refactoring because I didn't know that the instance-level API was so intertwined with the group-level feature.
I think we should still allow merging release posts without the feature being deployed to production
@wortschiThis is the policy on when a release post can be merged. I'm sorry I wasn't around to better support you on this.
Once all content is reviewed and complete, add the ~"Ready" label and assign this issue to the Engineering Manager (EM). The EM is responsible for merging as soon as the implementing feature is deployed to GitLab.com, after which this content will appear on the GitLab.com Release page and can be included in the next release post. All release post items must be merged on or before the 17th of the month. If a feature is not ready by the 17th due date, the EM should push the release post item to the next milestone.
Entire "this feature is beta" thingy could be avoided if we agreed with Sid or any other person who has high influence on decisions before introducing entire Segments abstractions.
@pshutsin The decision to introduce Segments was made before I joined groupoptimize so I don't have much context except I think we were trying to address customer concerns about excluding certain projects or trying to map to their org structure. TBH if I was the PM at the time I don't think I would have predicted the amount of internal pushback we got on this, but after hearing the complexity argument I do get why we wouldn't want to do that. This was one of the motivators for me to set up those vision review sessions with Anoop so we can be more certain we're not introducing unsupported surprises.
@gitlab-org/manage/optimize I really appreciate the open discussion we're having around this and the willingness of team member's to add their opinions. This is a sure sign that we can learn and get better together.
Great discussion! I really like the idea of assigning priority labels to feature flag enabling / removal, definitely (to me) helps mentally position feature flag work more like "first class" work, and should help ensure its something we dont miss / forget in the process of building a new feature.
+1 Also to adding feature flags into the agenda, great suggestion!
This is the policy on when a release post can be merged. I'm sorry I wasn't around to better support you on this.
Thanks for linking to this section. The following paragraph caught my intention:
The EM is responsible for merging as soon as the implementing feature is deployed to GitLab.com.
This means we need to be very careful with last minute merges. Merging into master doesn't guarantee that this is deployed to production at the same time. Typically, there's always some time between a merge into master and a production deployment and we need to hold off with merging a RP until we have verified that the feature is actually deployed to .com. Just wanted to call this out here as I believe there's an important detail. We recently a related problem when we merged a feature for to let users schedule un-setting of their busy status. (see https://gitlab.com/gl-retrospectives/manage/-/issues/77#note_541733446 for details)
Anyway, with our soft cutoff date this shouldn't cause too many issues in the future
THANK YOU ALL for the very transparent discussion we had about how we can make fewer mistakes as a group. We've decided on some concrete steps:
Assigning a single engineer as the DRI on each "feature flag rollout" issue, so responsibility is clear.
Applying a P1/P2 priority label on each "feature flag rollout" issue, so they are part of every board-walking.
Setting a blocking code MR on every release post MR.
Those are process improvements. I suspect there are also improvements we can make to our feature refinement and workload balancing. I'd like to send a survey and get feedback from you all - I'll share a link soon.
Closing this because it has reached its due date. Please check out the 14.0 retrospective issue next!