2020 team impact overview

At the end of the year 2020, it is a good time to create an overview of the year behind us. This issue is different from #743 (closed) in that it is intending to summarise events, and not necessarily pick a specific project out.

Scalability team in 2020

The team has been in the forming mode for the first half of 2020. We were going into the year with a single backend engineer (Bob), an interim manager (myself) and a distinguished engineer (Andrew) not officially a part of the team but driving the technical roadmap. By June 1st 2020, we were fully staffed. At the start of the year, the plan was to fully staff the team before building out Scalability 2. The pandemic had other plans for all of us!

Regardless, with the overview below I hope to provide a great read, and log a testament to how much the team has contributed to its mission.

Background job processing improvements (also known as: how to replace wheels safely while also driving the car at 200 Km/hr and checking your phone for messages)

From this perspective, sidekiq improvements might seem as if it does not belong to the 2020 year overview. After all, we did start on that project in 2019 and a lot of prep work was already in motion as team members started joining the team. However, as with any fundamental architecture work, the work expands to fill the time. In the case of background jobs it is a bit unfair to state that given the profound impact they have on our platform. From the blog post published at the end of June 2020

We reduced our Sidekiq fleet from 49 nodes with 314 CPUs, to 26 nodes with 158 CPUs.

Pulling this out of the context would have you question if the impact was as large as an investement. Lining up everything that came with it:

Migrating Sidekiq-cluster to Core and setting it as default
Pairing with teamDelivery to unlock the Sidekiq migration
Ensure individual queues meet their SLOs, such as authorised projects and reactive caching
Ensure sidekiq jobs can be safely retried

shows a real user impact and improved general user experience. For example, authorised projects queue would often build up, which would manifest itself as users not having access as soon as they were granted one. Or if a job failed, and was not retried user could see a failure regardless of whether the issue was a temporary glitch. This is all not taking into account that we made the development experience better, provided the guard rails on how to run the jobs at a large scale, and deduplicated the jobs reducing the need for larger infrastructure.

Continuous profiling of Golang services

Continuous profiling project seemingly came from nowhere. I want to highlight how much of an amazing collaboration this was between backend engineers and SRE in the team exposed by the DEI, unlocking a powerful tool for everyone working on these services to get better at writing code (and not only at scale). The idea turned into a side project, that unlocked great possibilities for people who want to make improvements. This allows people on call to reduce Mean Time to Detection, and developers to do pre-emptive improvements. Quite an impact and a worthy time investment if you are to ask me.

Database connection pool optimisation

The project that should not have been one if you were to talk with people before we decided to revisit it. This is not to say that anyone was making a wrong decision, but highlighting the power of disagreement which can be revisited in light of new details. We seemingly had the configuration set exactly right, which did allow us to have everything operating as expected under normal circumstances. If the circumstances were to change, we had the tools to change the configuration but was the turnaround time acceptable? As stated by Sean, adding headroom and ensuring that we track metrics allows us to reduce the risk and think about the problem outside of an incident. Some would say that this is what scaling is all about.

The impact of this project is large for GitLab.com in that we significantly reduced the possibility of having a customer impacting issue. The fact that the original incident had no customer impact is a lucky break, but the discussions that followed showed a large number of purposeful discussions that have certainly prevented real customer impact in the long run.

Deep dive into Redis

This might sound odd to people familiar with the project, but this project was a highlight of the year for me personally. The blog post about the project is a great read and great lesson in iteration. With the win under our belt after the Background job processing epic, we were adamant to repeat the same with Redis. After all, this service has been running rock solid for a few years after the many scaling issues we had back in 2017, so surely it was time for it to be a problem again? With that clear(?!) goal in mind, we were set to make an impact.

By the end of a 3 month long project, I would argue that we did way more than we expected. Apart from the technical benefits:

Increased observability that allows us to be quicker at finding issues and better understand how we utilise Redis from the application side
Sharing the knowledge with teams external to Scalability
Performance improvements in the application at GitLab.com scale, and making others aware of the improvements they need to make

, I would state that the impact on the team was significantly more important. We learned to question our approach to work:

Ensure that what we do has impact every step of the way, not only at the end
Keep control of scope creep, agree on exit criteria and spot check it every step of the way
If you don't know where to start, start from the events that already happened and see how you can improve them. Try to make a projection based on the outcome of your first set of steps
If you start losing sight of the goal, speak up "loudly"

My absolute favourite thought from this write up is that we will always have a comparison with "the Redis project", as a baseline of something we don't want to repeat. With the project that generated so much real value, that "low" baseline is something I would take any time of the day.

Decouple Puma from Pages NFS

Speaking of impact every step of the way, and keeping the control of scope creep, let's talk about Decoupling Pages NFS project. The project came in a very inconvenient time, and was not easy to connect with the team mission. However, in my view it was absolutely in line with the impact team is supposed to be making. This is yet another project that directly helped teamDelivery by isolating the dependency on Puma NFS to only a single Sidekiq queue which unlocked the teams migration of Sidekiq but also other services (Git https, web and api). Where this project shined was the fact that we stayed focused on delivering immediate impact unlocking horizontal scaling possibilities of other services, but in parallel we worked directly with the Pages team to ensure that the new architecture can be scaled on GitLab.com. It might feel like a bit of a stretch to say this, but by doing this now we have most likely moved a project that would inevitably end up on our plate in the mid term.

Feature category information

Recording feature category and Dashboards for Stage groups are projects that get a mention here because they will unlock a serious amount of value in 2021. The fact that we can attribute errors and latency to specific groups of features (and with that teams) allows us to work closer with specific groups of teams whose features will benefit from improvements on GitLab.com. This means improvements for everyone working on GitLab, but also using GitLab.com. If you read this far, you are probably rolling your eyes at the mentions of Error Budgets because you probably heard me rant on about it.(Also, thank you for reading this far, I appreciate that very much). We didn't manage to make 2 previous attempts stick, but the third attempt is in progress which I have high hopes for.

What is in line for 2021?

This issue came as a prep for writing out a strategy for the team in 2021. The draft MR is an attempt to align with the department strategy which is still being defined.

From that MR, the top of mind items for me are:

Get better at measuring the impact of the team. While not perfect, getting some real numbers for Delivery was much simpler. The team has to have a PI as a north start to avoid focusing on the wrong things. Right now, setting the focus requires a large amount of effort.
Stay ahead of the demand by getting better at predicting. This is hard, and often a frustrating experience but I do not think that there is a better equipped team at this company to get a grip on this. This is the only way we will help Reliability teams, but also the rest of the engineering to ship better features for everyone.
Continue delivering tangible value for both GitLab.com, but indirectly our self-managed customers who run at a larger scale. It has been mentioned too many times, but every improvement we make at scale for .com is a challenge our self-managed customers won't have to deal with.