This is an asynchronous retrospective for the 12.3 release, following
the process described in the handbook.
This issue is private (confidential) to the Configure team, plus anyone else
who worked with the team during 12.3, to ensure everyone feels
comfortable sharing freely. On 2019-10-19, in preparation for the engineering-wide
12.3 Retrospective, the issue will be opened up to the public, as long
as everyone is comfortable with this. You're free to redact any comments that
contain information that you'd like to stay private before that date.
Please look at back at your experiences working on this release, ask yourself
👍 what went well this release?, 👎 what didn’t go well this
release?, and 📈 what can we improve going
forward?, and honestly describe your thoughts and feelings below.
For each point you want to raise, please create a new discussion with the
relevant emoji, so that others can weigh in with their perspectives, and so that
we can easily discuss any follow-up action items in-line.
If there is anything you are not comfortable sharing here, please message your
manager directly. Note, however, that 'Emotions are not only allowed in
retrospectives, they should be encouraged', so we'd love to hear from you here
if possible.
I think we sometimes end up losing multiple days getting something merged that is ready because we're not treating it as urgent. I don't want to assign blame but want to determine if my intuition here is right and if there are ways to avoid this.
At the time of writing this has not been merged yet. The code was ready and useful for a few days. There were some minor docs feedback points but I'm inclined to think:
We should have merged before the docs review due to how small the docs were and gotten retrospective feedback
The time taken to get the feedback and go back and forth with maintainer should have been reduced
The interesting thing in this case is that this issue was considered a ~P1 operational issue and all the people reviewing are in a similar timezone.
Based on discussions with other managers and observations from #17 (closed) I wanted to start a conversation about deadlines.
I notice that Configure team (compared to other teams's based on conversations I've had with other managers) places a relatively low emphasis on deadlines. In fact we almost never put any energy retrospecting on why our deadlines have been missed. This is generally how I am most comfortable working. It's not objectively a good or bad thing but I'm curious to understand in general if people are motivated by deadlines overall.
We could continue to go in the direction we're going which is less emphasis on deadlines and more emphasis on working on prioritised lists as quickly as we can but it may not necessarily be ideal for everyone's preferred working style.
Some questions:
Do people feel motivated or demotivated by deadlines?
Do people wish we were clearer about deadlines?
Do people have thoughts about the length of deadline windows (sprints) that are effective for them?
Do people feel that we should be retrospecting more heavily when we fail to meet deadlines?
For me feeling productive and getting things done is much more motivating than a deadline. Deadlines give me the impression that I should somehow be working harder or lowering the quality of my work.
I think all the long running issues I've worked on had pretty large amounts of investigation, and implementing a lot of different dead ends to see how things (don't) work. I wonder if for these issues having a deadline or slipping could be a good trigger for the team to reevaluate the issue, possibly even break it up. Something that hasn't been clear to me is that if the issue is taking so long, is it still a good idea to continue, is it still a priority. I know we're supposed to ask for help, but it can often feel like you are on the cusp of working out a solution, without realising that you're spinning your wheels.
Another thought is deadlines might help for reviewers and maintainers to prioritise work that otherwise are in arbitrary order unless the MR author specifies. I've had a couple of kind of annoying situations where my older trickier changes got reviewed after a newer trivial change causing merge conflicts, I assume because they were reviewed in ease order rather than any sort of priority.
I find deadlines motivating and a good way of prioritizing work, but I think we should not push too hard to deliver within a deadline because it will reduce the quality of our work.
We can’t be 100% accurate about delivering before the deadline because in some deliverables, (the complex ones, in particular) we are still discovering requirements during the development phase.
I also think there is value in retrospecting missing deadlines. Sharing our thoughts about the reasons a deliverable missed a deadline allows us to brainstorm as a team about what we could do differently next time.
I for one find it helpful having a deadline and then setting mini-deadlines/goals for myself during a particular milestone because it gives me a sense of progress and a gauge to measure my progress in order to try and give 'accurate' reports on work during 1:1's.
I set deadlines/goals for when I'd like to have MRs in review for instance. When an MR is in review, ideally I'd have given myself ample time to address reviews and ultimately report back on how I feel about reaching the milestone deadline.
Do people feel that we should be retrospecting more heavily when we fail to meet deadlines?
I do like the idea of retrospecting missed deadlines as @ealcantara mentioned as I think it will give the team an opportunity to learn and (possibly) avoid missing deadlines in the future.
Personally, I don't have find deadlines motivating at all
I would prefer focusing on the highest value issue / initiative and delivering 💪🏼. And being as clear as possible what the next steps are for things I'm working on (not always achieved!), and when those next steps could be expected.
Do support retrospecting if we have missed a deliverable for this milestone though - we might learn something about what to do differently, or not.
It seems there is a fairly common trend across most teams where throughput in August was much lower than July. For group::autodevops and configure we can see that it was 98 in July and 75 in August.
Does anyone have some guesses as to what might be the cause? Was it something we were doing really well in July? Was it possibly related to taking more leave on average in August? Perhaps if everyone has some suggestions they can leave them here in the thread so we can try to find a pattern. I have a couple of things that I think contributed a little for our team.
Investigative work for DB load balancing was initially exploratory. This was not all that long though before we were making merges so probably not a good example.
For frontend, there was more leave taken (public holidays and end of summer) as well as Mike and Enrique being away for a week at JSConf, which would have contributed to a reduction in throughput from FE.
There were 7 (merged) merge requests to https://gitlab.com/gitlab-com/gl-infra/gitlab-gcp-janitor in August that unfortunately aren't counted towards our total. None of these are large changes, and won't have contributed to more than a few days of cumulative work, but it all adds up.
I've introduced a regression but, luckly, I found the regression quickly because I was still working in another related MR. The regression wouldn't be obvious to find since what it was doing was "to not remove the install pod" after it finished installing. After acknowledging the MR, @tkuah searched the error on Sentry and confirmed that the bug indeed was happening in production.
Since deployments to production are starting to happen on a shorter cadence, it becomes even more important that we acknowledge this regressions as fast as possible, assuming they can be detected, i.e. they raise some sort of error).
@tkuah and I started discussing on slack, what are the best way we can try to acknowledge a regression as soon as it reaches production. Sentry provides 2 charts in the overview tab: TRENDING ISSUES, NEW ISSUES. Perhaps it's a good idea to, at least once a week, take a look at these. Although, TRENDING ISSUES shows high volume errors which might suppress our error, since our features might be less used.
Like @Alexand, I also introduced a regression and we only noticed because a user created an issue. I do not think this particular issue could have been easily detected on Sentry, but maybe some form of anomaly detection could have helped.
Is anyone aware of pre-existing logs or metrics that could have been useful to detect a problem with auto-build-image as soon as it went live? What about for other features for which we are responsible?
For example, for GitLab.com, having a metric for the total number of ADO build jobs, as well as a metric for the number of failed ADO build jobs, could have been useful to tell if the change was good. This would give us additional confidence when changing auto-build-image. Having diagnostic logs would then be instrumental to identifying the actual problems, but as much of the data is likely sensitive, I doubt we log can much besides build ids
@hfyngvason you and I discussed this the other day and I've created a follow up issue at gitlab-org/gitlab#32969 (closed) . I think we can work to scope this out better and get it prioritised as this will definitely provide value to our users if we are detecting this stuff early.
@hfyngvason you can just add group::autodevops and kubernetes. There is a bot that is meant to infer the other one from this label. I don't know if the bot is working just yet but it doesn't really matter considering our accounting is happening based on group::autodevops and kubernetes.
The exception here is Configure Frontend that needs to always mark MRs with devops::configure (in addition to group label for stage work) as our team works across multiple groups and on work outside our stage (e.g. Docs project, Working Groups and Gitlab UI).
We've managed to get most of the work involved in completing the Cluster Environments feature merged iteratively and independently behind feature flags which I think is great!
I thought it might be worth highlighting this since independent delivery was an issue that was mentioned in a previous retrospective and it's the first time we're trying it out the proposed process 🙂
Our board was updated to filter on group::autodevops and kubernetes.
We still have a habit of applying the ~Configure tag on Configure issues. (though MUCH much less than it was before 👍). Let's make sure we're no longer using the ~Configure tag, instead only using the autodevops and group scoped labels 🎉
📈Board and ~workflow::staging, ~workflow::canary labels
I think our Board may not take into account ~workflow:staging and ~workflow::canary labels. I remember an issue moved into the open column because of this
This is a shame and I can't really think what could be contributing to this. Maybe the numbers are normally low enough that this isn't too unlikely to just be a statistically normal event but I wonder if there is something about how we work or perhaps we've been focusing less on this.
We should try our best to keep encouraging community contributions and following up really closely when we have new contributors starting work. If you are aware of any that have been open for a while without much attention let's raise the attention and see if anyone can help finish them off.
Would the change in how frontend team label things (ie. only using devops::configure for some of the other work outside our scope like docs, design system) have contributed to this drop? Do we have more ideas about what else could be contributing?
@DylanGriffith it will likely have had an impact on throughput stats. I ran the stats and we have 11 of our 25 MRs for September that are not related to stage work and thus will only have the devops::configure label.
Since we now have internal group assignment in the team I could have the team use the group label for all MRs, regardless of which project it was in, if that helps create more consistent throughput stats for the groups?
On 2019-10-19, in preparation for the engineering-wide 12.3 Retrospective, the issue will be opened up to the public, as long as everyone is comfortable with this. You're free to redact any comments that contain information that you'd like to stay private before that date.
Is it ok to make this issue non-confidential now ? I don't see anything that prevent is so