2022 Team Impact Overview
Background
My goodness. Another year!
We have come a long way from where the team started with Bob, Marin and Andrew in 2020. Now we have two teams made up of two EMs and 14 engineers! This is definitely one of our biggest successes this year - building out the team with wonderful talent and figuring out how to work with each other. We also developed themes to help us group the types of work that we take on. We hope to use these themes more as we focus on our prioritization process next year.
Before we take a look at the projects, it's nice to remind ourselves about what we had set out to do this year: #1473 (closed)
Let's take a look at the work we accomplished.
Project Highlights
This isn't meant to be an exhaustive list and there are projects that I haven't included here.
Giving stage groups control over their endpoint SLIs.
At the start of the year, we delivered the projects that enable stage groups to set their own thresholds per endpoint and have these thresholds reflected in the Error Budgets. We then engaged with the stage groups to opt-in to these new calculations and by May, all groups had opted in and the legacy non-configurable apdex recordings had been removed. This has been our largest achievement with Error Budgets this year with the process remaining relatively stable since then.
Taking ownership of Capacity Planning
We rapidly approached the scalability limits of Andrew-as-a-Service and brought Tamland and the Capacity Planning process into the Scalability team. The timing worked well and this helped to form our understanding of what the Projections team would be responsible for.
The process itself was highly manual and was taking engineers (mostly Bob) almost a full week to process the reports, and by that time a new report was generated and ready for inspection! A lot of effort was applied here to automate what we could and streamline the rest. By the end of the year, we have the process reduced to a few hours a week with a triage rotation that is shared between the Projection team members.
Redis, redis, redis
Last year, this section was "Redis and Sidekiq" but I'm pleased to say that with the massive effort on Sidekiq in 2021, we landed up paying it very little attention this year. It just worked! Redis however, has been somewhat of a struggle.
In January, our collaboration with the Memory team resulted in a new Redis instance for session keys to alleviate the pressure on redis-persistent
. This saw 51% of the data storage removed from redis-persistent
along with a drop of 30% CPU utilization per node. This was a different way of working for us, but we saw that we could support another team making the changes to functionaly partition a redis instance. However, we already knew that a neverending series of Redis instances was not a viable long-term solution.
At the end of 2021 we had concluded discussions on a strategy for Redis and chosen a path forward with Kubernetes. In April we started the research for understanding how Redis on Kubernetes was going to work and developed a proof of concept that we could follow.
There was a request for a new Redis instance (for Registry) and we took the opportunity to deliver this on in Kubernetes straight away so that we would have a production instance in place. We learned a lot in the process and tried to deliver a second Kubernetes Redis instance that would house the rate limiting keys. This showed us that the operability of Redis in Kubernetes is indeed easier than the VM installations, but the high traffic for the instance lead to high CPU utilization - and despite our extensive benchmarking, we hadn't picked this up as a risk. Ultimately we weren't going to gain enough headroom on this instance, so we reverted to VMs.
Another Redis instance that caused us grief this year was redis-cache
. To be fair, it had been causing us grief since at least March 2020. And when the spikes started causing problems in the Error Budgets we knew we had to act quickly and with clarity so that we didn't degrade trust in the Error Budgets themselves. This lead to a notable detailed investigation to fully understand what was going on (resulting in an amazing blog post), and presenting a number of options we could take. Fustruatingly, the solution was incredibly boring - to reduce the TTL! So we have bought enough headroom now to implement a tactical solution to beat the new saturation warnings, as well as a long term solution for how we handle all of these Redis instances.
And I lie - there were two Sidekiq projects, but it was redis-sidekiq
! In one we started using a WebHook instead of pushing to redis-sidekiq
, and in the other reduced the CPU utilization peaks from 60% to 40% which was a nice win!
And gitaly!
We landed up working on the fleet upgrade project to help get the Gitaly servers up to date. Saying that 85 servers were upgraded in 3 days makes this sound trivial, but a lot of time was spent making the process safe and repeatable with as small a downtime window as possible.
We also delivered cgroups for Gitaly. While this was a follow-on from work from Reliability, it was of significance for Scalability because individual file servers were suffereing from memory saturation from both abusive and normal workloads. Following the rollout, we've noticed several incidents being prevented and the alert frequency for gitaly incidents being significantly less.
Introduction of cost data
Cost data was also an aspect of work introduced into the Scalability team this year. Blake has stepped into this challenge and is navigating the variety of systems, people and processes needed to generate good information. For the most part, this has been initially focused on supporting Finance's need to label all of the production resources but we look forward to bringing this back to application level cost awareness next year.
Looking forward to 2023
- Creating teams and devising strategies has been a big part of 2022. Looking forward to 2023 we need to be focused on making our teams high-performing and turning these strategies and plans into tangible results.
- We have further work to do on how we work together and how we choose what to work on. Liam and I will spend (many) hours focusing on this . We also have additional support now with the introduction of a Product Director into our sub-department (hi Fabian!).
- To that end, we will explore the themes we created this year and try to shape our work with these in mind. They also support our existing categories of work and we need to convey clear messages for the direction we want those categories to take.
- Our most externally visible category remains the Error Budgets and in the new year we need to implement the indicator that we have been discussing. We should be able to use this as a guide to improve the quality of information represented in the Error Budgets and provide more detailed information to stage groups as a result.
- Rather than just having Capacity Planning be a thing that we process, we need to bring it in as an input to our work. We need to figure out how to use this as a prioritization tool and how to get other teams to do the same.
- Redis... we need to move from being "all things Redis" to being a team that enables other people to do Redis things. We need to provide scalability options and solutions and enable other teams to implement those options when they see the need arise.
As to "how", we'll figure that out.
In closing...
Thank you to everyone who contributed to these brilliant challenges. Let's see what 2023 has in store!