2023 Team Impact Overview

Background

This is the fourth annual overview issue!

It has been a significant year of change. We started the year needing to handle the Reduction-In-Force followed by an exceptionally fast pivot to supporting AI initiatives. Then two of our longest-tenured team-members (Sean and Jacob) decided it was time to move on, with another long-tenured team member (Andreas) deciding it was time to come back! Plus Alejandro was then bitten by the AI bug, so moved over to Development to work closer with AI development projects. And just as we thought we were settling down on the run-up to the festive season, the Infrastructure re-org happened, and with that, the inclusion and welcome of Observability and Practices into our group.

Oh, and we did some projects too 😄

Project Highlights

This isn't meant to be an exhaustive list.

Runway and support for Innovation

This one took us by surprise. We've been talking about the need for a Platform tool for a while, and when we suddenly had an ambitious target of needing to support at least 40 new features we decided that having the tool to support that innovation would be critical to meeting that target. The target for the number of features we'd need to help into production reduced as the AI landscape became more clear, but we continued to make a tool to improve how we manage services for GitLab.com. We borrowed Graeme from the Delivery group and we delivered Runway.

Runway now carries 100% of the production traffic for the AI Model Gateway and we have made great progress towards having both PVS and the External License Database moved over.

Capacity Planning

Tamland and the Capacity Planning process have come a long way and look very different from what was in place this time last year. The triage process has come right down having placed more in the hands of the service owners. We've also tried to set up the process for Dedicated which has not been as straight-forward and hopefully we can work through that early in the new year. My personal favourite has been the additional information that is now visible on each graph, so that it's clear to the user why the chart's trend has a sudden change.

Redis

This one is very impressive. This year we reaped all of the rewards of having invested So Much Time into Redis over the past few years. There is absolutely no way we could have delivered this much without the groundwork that was prepared. So for this one, I will list out the projects...

Rdb backups on primary nodes
Redis-sidekiq deduplication on new redis clsuter node
Migrate excluside leave keys from redis persistant to redis-cluster-shared-state
Functional partition for pub/sub
Horizontally scale redis cache using redis cluster
Functional partition for feature flags
Rack::attack redis calls
Functional partition for db-load balancing
Reduce redis sharding toil
Functional partition for repository cache
Redis cluster for rate limiting
Redis cluster for gitlab chat

Twleve projects. Mic-drop.

Sidekiq

Similarly, our prior work on Sidekiq means it has remained a low-touch service again this year. We've made some changes to defer sidekiq jobs via feature flags, adjusted routing rules and stopped using namespaces.

Error Budgets

And Error Budgets. These continued to be a useful tool for the stage groups and we made some improvements to enrich rails errors with endpoint informatino and make the Sidekiq SLIs explorable.

The largest work that is still ongoing is to improving the quality and trustworthiness of the data that goes into the Error Budgets. We had a period of instability that was worrying, but we are back on track now. Having recently combined with the Observability team, we are now on a good path to wrap this up in the new year.

Research

We also put our knowledge to good use to help others understand saturation concerns or saturation events. These ranged from gitaly disk space leaks, database lock saturation and replication lag to gaps in our metrics. We also seemed to have a recurring theme of latency research, starting with cross-region latency for Cells and continuing latency challenges in Code Suggestions.

Looking forward to 2024

Right at the end of the year we joined forces with some team members from the Reliability group, bringing Observability and Practices into Scalability. It's wonderful to have more knowledge come and join us and I'm looking forward to how we can harness all of this experience into truely excellent experiences for those who use our services.

We are also entering a period where things must change. Our customers and our commmunity demand more. AI will continue to drive our focus. And we need to be able to scale to meet those needs.

So when we are all back in January, we'll be looking at our plans in Scalability. We'll be looking at the 3-5 year goals, where we need to be in 12 months, and how that comes back to our OKRs for Q1.

Already we are considering how our Observability stack needs to change to handle all of our platforms and their growth, along with how we continue to send relevant planning information to our stage groups and service owners.

Alongside that, we need to see how we can better serve the stage groups who need to onboard new features into Production. We need to continue to work on enabling them to have ownership of their features and services.

Thank you all for a tremendous year, and looking forward to working with you all again in the new year.

Edited Dec 15, 2023 by Rachel Nienaber