Skip to content

2021 Team Impact Overview

Background

Stepping into 2021 I was a little unsure about what exactly we were going to get up to. In January all I could see was Rate Limiting and by the time the first quarter was finished, we were already well on our way with plenty of other projects. August was a bit scattered as we prepared for an exciting October, and now we sit in December and take stock of how many things we have accomplished. While exhausting, this is not an exhaustive list - a few highlights and notable mentions of 2021.

Project Highlights

"Just" Rate Limiting

We opened the year with the Rate Limiting project still in progress. Having started in October of 2020 off the back of 13 production incidents, we finally enabled this in Production in January. Concluding such a large project early in the year was a good start. It was also the start of my personal quest against the use of the word "just" because "just enabling rate limiting" turned into 4 epics and 33 issues of work. It was a smashing result though, because we stopped the incidents that were related to rate limiting problems and we did it without causing significant customer or engineering heartache. Even though the combination of rules at various levels made Craig's head spin at times! The dry-run mechanism proved most useful and it was a relief when we finally flipped the switch and didn't see an increase in support cases.

Revisiting Git Fundamentals

There were three projects related to Git and gitaly performance.

The first resulted in a remarkable 50% deduction on CPU time on file-cny-01 by configuring CI to fetch with --no-tags. We also took a closer look at the performance of fetch/clone on Git when there are lots of refs and submitted the patch into the 2.31.0 release.

Then we tried very hard to remove our dependence on the pre-clone script for gitlab-org/gitlab. In our adventures we found that there were clearly two different bottlenecks: one where generating data was the problem, and secondly, where transferring the data off of the servers was held back. We put the pack-objects cache in place to handle the problem with generating data and we investigated further to find out why we couldn't transfer data off the server as fast as the network would allow.

So then we "just" replaced gRPC with plain TCP sockets and hit 3.9 GB/s and dropped the CPU utilization to 60%. A lot (looooot) of work and collaboration went into this with the gitaly team. We had a few hiccups and a restart, but we've come out with a better way of working together. We've committed to making Gitaly aware earlier on about the work we're performing, and we'll bring them in when we find problems rather then waiting until we have the solution too.

And what would a year be without a Redis and Sidekiq project?

We started with 99% CPU saturation and a lot of nervous pointing at the Tamland charts. And in the 5 months it took us to complete this work, we also saw a 25% increase in the job throughput - if we hadn't done anything, Redis would have melted.

This is one of the most significant projects we did this year because it showed that we prevent multiple major S1 incidents. (Yes, I know we still caused an S1 while we were rolling this out, but more on that later.) Sometimes it can be hard to focus on project work because there are incident calls we can contribute to. But if we don't step back and focus on longer term goals, projects like this one won't happen. I can't begin to imagine fielding multiple S1 incidents AND trying to do this project at the same time...

So yes, that S1. I'm glad we did a retrospective at the time, because I honestly can't remember too much. But what I know is that we fixed it quickly. We work on things that are so fundamental to the system that we are going to break things once in a while. We have to be comfortable with taking some risk or we'll wrap ourselves in knots trying to make big changes. And I was particularly pleased with the resolution we came to where we tested in the new settings on one node for each change. It was the best balance between speed and risk reduction.

Of special mention is also the work to reduce the size of the Sidekiq job payloads. It happened quitely and without incident. And now jobs are small and compressed :)

And of course, Error Budgets

We actually did it. We actually got the stage groups to think about reliability and performance. Yes, they're using 5 seconds as a threshold and yes they still need to opt-in to custom stuff and yes there is so much more to do - but the foundation is there AND people are using it!

I've never worked at a company where this level of insight is available for how one's code runs on the production systems. Or where there is real data that can be used in a discussion about allocating time to performance work. This project was a major efficiency boost and it came at the best time when there was a drive across engineering to focus on reliability. Without this data, teams would have been left to figure it out on their own and make guesses as to where to spend their time.

So much so that there is now a separate summary epic for the work in this category and a separate issue for a roadmap. I'm super excited to see where we can take this next year.

Looking ahead for 2022

  1. Growing the team. A whole team. That should keep us busy.
    1. There are no decisions yet on how to make our team turn into two teams. I plan to welcome the new EM to the existing team and figure out what their strengths are before allocating teams accordingly.
  2. Using Tamland (or other) as an indicator. We still rely heavily on human knowledge of how our systems are performing and need to develop metrics to show the team's contributions.
  3. Further improvements on Error Budgets and alignment between this and the InfraDev process.
    1. With additional team capacity, we can take full ownership of the relationship between Infra and Dev to take some pressure off of Andrew.

More detail on these items, and on how Scalability will develop within Platform will be shared separately.

Finally...

Thank you all for a marvelous year. Looking forward to a productive 2022 😄

Edited by Rachel Nienaber