Scaling GitLab's SaaS Platforms (#148) · Epics · GitLab Infrastructure Team

Scaling GitLab's SaaS Platforms

Scaling GitLab's SaaS platforms is a continuous task. We create issues and epics as we encounter things, with the goal of addressing issues based on the [Scalability team Work Prioritisation process](https://about.gitlab.com/handbook/engineering/infrastructure/team/scalability/#work-prioritization-process). Below you can find issues, epics and boards that drive our work. Summary of Scalability issues not in Epics: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/538 ## Project Work ### :stop_button: Triage Epics that are currently being triaged | **Topic** | **Summary** | |-----------|-------------| | [Roadmap - Scalability:Practices](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1202) ~"team::Scalability-Practices" | | | [Code Suggestions Infrastructure Improvements](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1008) ~"team::Frameworks (deprecated)" | | ### :white_check_mark: Completed Work Items that have been completed <details> | **Topic** | **Started** | **Ended** | **Summary** | |-----------| ------------| ----------| ------------ | | [Scalability Completed Epics 2020](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1059) ~"group::scalability" | 2020-01-01 | 2023-06-27 | **Nested Epics: 16** • https://gitlab.com/groups/gitlab-org/-/epics/3980+ **2020-09-21**: All Pages NFS access on GitLab.com now happens in Sidekiq, not Puma, and we've removed the feature flags. All Pages disk access from Puma will raise an error in any environment (development, staging, etc.). Pages NFS has been unmounted from the Web and API fleet in production. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/215+ **2020-08-07**: We now have a sensible default, which GitLab.com is using. We have the option to add more connections through an environment variable. We created saturation metrics and alerts to monitor the pool usage, and added more detailed thread-level metrics to help debug if we do have issues in future. This should prevent us exhausting the database connection pool in future - and if we do exhaust it, debugging will be much easier. When we started this project (and recording metrics), we peaked at a fully saturated (100%) connection pool. This means that jobs would wait before getting a connection from the pool. This could cause a job to be retried if it needs to wait for longer than 5s. This slowdown isn't directly user-visible, but it's wasting resources. After the last change was deployed on 2020-07-29, we're peaking at 60% utilization of the connection pool from sidekiq usage. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/181+ **2020-05-21**: `sidekiq-cluster` is now the default for GDK and self-hosted instances which brings the environment that developers work on in line with what is run in production. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/309+ **2020-09-03**: We investigated several instances where we saw heavy usage of Redis. These have now been closed or moved to `gitlab-org/gitlab` and handed to the relevant stage group. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/96+ **2020-06-11**: It is now possible for application developers to define the requirements of their workers based on the job's latency requirements, priority, and resource needs. Based on the characteristics defined, these background jobs are routed appropriately to fleets that are provisioned to best serve these needs. Visibility into these queues has been enhanced, and we used that increased visibility to improve problematic background job queues. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/327+ **2020-11-18**: There are no more active `BadRequest` or `invalid byte sequence` reports in sentry. We'll never cover all cases, but these were the most common ones we saw during incidents. In the future, when automated traffic like this hits in bulk, we'll be rejecting the requests with a 400 response rather than an error. This will cause the invalid requests not to count towards our apdex and will hopefully prevent alerts. These have been addressed using Rack middlewares, high up the stack, hopefully catching a lot of invalid requests over all endpoints. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/269+ **2020-11-20**: Feature category information is available for each Rails controller and each API endpoint. This is surfaced in the [web dashboard](https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1). Changes have been communicated with teams. This paves the way for custom dashboards and alerting for stage groups that will be addressed in future projects. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/285+ **2020-10-06**: We have addressed a series of problems where data in Redis persistent wasn't correct. Cleaning this up removes noise from the data so that people investigating issues are less likely to be led astray. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/265+ **2020-09-28**: This project is completed. - Redis guidelines have been added to the development documentation: https://gitlab.com/gitlab-org/gitlab/blob/master/doc/development/redis.md - Redis training videos were created: - https://youtu.be/BBI68QuYRH8 - Redis Slowlog - https://youtu.be/jw1Wv2IJxzs - Redis observability in Sidekiq logs - https://youtu.be/Uhdj19Dc6vU - Sorting Rails log entries by number of Redis calls - Redis guide for SRE's created: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/redis/redis-survival-guide-for-sres.md • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/286+ **2020-09-02**: This project has been completed. We were able to reduce the SMEMBERS calls from an average weekday rate of 175 to a new average rate of approximately 30 calls. This should help prevent outages like the incident from January where Redis became saturated. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/211+ **2020-06-25**: We rescoped this epic to solely be about providing a validator within the application to catch cases where a command would not be compatible with Redis Cluster. We do not currently plan to enable Redis Cluster on GitLab.com, but we want to reserve the option for future. The validator and that's enabled in development and test modes. Currently known cross-slot commands are allow-listed with `Gitlab::Instrumentation::RedisClusterValidator.allow_cross_slot_commands`. Development documentation is available at https://docs.gitlab.com/ee/development/redis.html#multi-key-commands, and we've created an issue to make CI build trace chunks compatible with this before we enable that on GitLab.com: https://gitlab.com/gitlab-org/gitlab/-/issues/224171 • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/212+ **2020-08-07**: We've finished this project. We now have much more detailed structured logging (time, calls, bytes written, and bytes read per instance), the slowlog available through Kibana, and latency alerts for all three Redis instances. We have also made it easier to perform analysis on the entire keyspace of a Redis instance. We'll be addressing some issues we found in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/286 and https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/285, and passing issues to stage groups in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/265. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/207+ **2020-05-21**: We addressed the 12 workers that were spending the most time on work that could be considered "duplicate", making sure they are idempotent and could be deduplicated. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/200+ **2020-07-03**: The Continuous Go Profiler has been enabled for Workhorse, Gitaly and GitLab-Pages. Memory and CPU profiler data is now available through the Stackdriver UI. Development teams are aware of the new tool and have access to documentation and a video tutorial on how to use it. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/191+ **2020-05-14**: We split the work that was running on this queue and moved different workloads onto more appropriate queues. We kept the CPU-bound work on this queue and moved items with low-urgency (or external dependencies) to a separate worker. Moving the large workload of reactive caching to the existing `low-urgency-cpu-bound` queue saturated the fleet and lead to growing queue lengths. The fleets were reprovisioned accordingly. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/176+ **2020-05-15**: Ownership for the remainder of this work lies with the stage group. We're watching the work that group::access is doing on this queue and supporting where necessary. | | [Scalability Completed Epics 2021](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1058) ~"group::scalability" | 2021-01-01 | 2023-06-27 | **Nested Epics: 15** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/463+ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/468+ **2021-07-12**: This project is complete. We introduced additional panels to the stage group dashboards to give teams an indication of where their error budgets are being spent. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/462+ **2021-07-08**: After trace chunks were moved to their own Redis instance we saw a [5% reduction in CPU usage and a 40-50% reduction in network traffic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/462#note_627163760) from redis-persistent. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/490+ **2021-09-22**: During a [discussion](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1177#note_669959820), we came up with a conclusion that the job compression ability benefits self-managed instances, but not job rejection. Loosing some jobs are way more dangerous than some big jobs. Therefore, we tweak the job compression to [allow compression without rejection](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/69792). The application setting conversion MR was merged. The limit was reset to 0 because there is not elegant way to deploy the application setting transition *and* set the limit to 5MB at the same time. That's why we added setting limit as a part of this change issue. The rejection was enabled not long ago. Before the reset, there were a handful of rejection per week. Therefore, we jumped to a conclusion that disable rejection temporarily waiting for this change to be rollout is not a bad idea. We disabled the rejection for 5 days. The final changes to clean up the environment variables and setting the limit back to 5MB were deployed to both staging (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5476) and production (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5475). All the metrics and logs indicate the compression is functioning well. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/372+ **2021-04-21**: Once the cache is enabled on the file-hdd nodes the production roll-out is complete https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4280. We will be closing the epic without achieving our original goal of not having to rely on the pre-clone script anymore for `gitlab-org/gitlab`. At the time we came up with this project, there had been recent production incidents that clearly showed there was a bottleneck in the ability of a single Gitaly server to _generate_ enough Git fetch response data. The cache we built seems to address the "generating problem" well. However, we now see a bottleneck in how much Git fetch response data we can _transfer_ off of the Gitaly server. Without the pre-clone script, the amount of data we need to transfer goes up and we hit this new and not-quite-understood-yet bottleneck. What we did achieve is to reduce near-incidents and occasional incidents caused by CI activity on `gitlab-org/gitlab` forks, which do not benefit from the pre-clone script. These used to hit the "generating bottleneck" and now they no longer do. Their volume is low enough not to hit the "transfer bottleneck". • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/406+ **2021-06-01**: This epic is now closed. We continue to engage with the stage groups regarding their dashboards as error budgets become more widely adopted. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/400+ **2021-03-16**: By looking at data collected during production incidents we discovered two performance problems in Git on repositories with many refs (gitlab-org/gitlab has 500K refs, most of them hidden from users). We discovered and documented an initial workaround: configuring CI to fetch with `--no-tags`. Looking deeper we then found ways to improve the performance of Git itself, which we submitted to the Git mailing list. These improvements got added to Git 2.31.0. We modified Omnibus and CNG to be able to ship these performance patches ahead of the Git 2.31.0 release. We wrapped up the project by removing the custom `--no-tags` CI setting on gitlab-org/gitlab and gitlab-com/www-gitlab-com because it is better to not have custom settings and rely on Git itself. The unnecessary server side work eliminated by these changes amounted to about 50% of the Git CPU time on file-cny-01 at the time we found the problem. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/382+ **2021-01-27**: The initial scoring for the maturity model is in the handbook: https://about.gitlab.com/handbook/engineering/infrastructure/service-maturity-model/. We've shared this information via the week-in-review and in the #infrastructure-lounge Slack channel. We've also created https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/398 to flesh out the unimplemented items in the model (these do not affect the level of any service at the moment, so we left them out of the first version). • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/360+ **2020-01-13**: - We've generated all the dashboards based on the stage groups in stages.yml, they are available in Grafana: https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups - The dashboards contain basic information: request rates, error rates, sidekiq-jobs and sidekiq error rates. - We've added documentation on how to read the metrics we've included by default on all the dashboards: https://docs.gitlab.com/ee/development/stage_group_dashboards.html#metrics-panels - We've added more documentation on how to customize the dashboard: https://docs.gitlab.com/ee/development/stage_group_dashboards.html#how-to-customize-the-dashboard - We'll work with a single stage stage to have a starting point: &388 - We've advertised the existence of these dashboards https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/666 Next up regarding the dashboards is &388. A short demo video of this: https://youtu.be/xB3gHlKCZpQ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/341+ **2021-06-07**: We enabled Rate Limiting in enforcing mode in January. Since then, we have had only one production incident from automated traffic. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/469+ **2021-09-09**: This project is completed. Peak CPU saturation is reduced from 99% to below 90%, with regular CPU saturation reduced from regularly being above 90%, to be often less than 80%. The project took 5 months to complete and over this time we have also seen an increase of 25% in Sidekiq job throughput due to the natural growth of the system. We also expect CPU saturation to grow more slowly in relation to job volume in future. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/437+ **2021-07-12**: We successfully introduced Error Budgets to the stage groups and gathered support from both the Engineering and Product Managers. We updated the stage group dashboards to display the error budget information as well as created dashboards in Sisense that will be used for Performance Indicators. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/526+ **2021-10-13**: This is complete. We've observed CPU saturation on the redis-cache primary drop from ~90% to ~70% due to this work, with no reported user impact. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/557+ **2021-12-10**: We will be moving redis instances into Kubernetes and have started on a project to do this for all but the `redis-sidekiq` instance. For that instance, we will have a separate project to create zonal clusters to address scaling concerns there. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/398+ **2021-09-17**: This project is completed. It's easier to update the handbook page now and there is a monthly pipeline to keep this up to date. We also implemented automation for the dependency graph. | | [Scalability Completed Epics 2022](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1057) ~"group::scalability" | 2022-01-01 | 2023-06-27 | **Nested Epics: 26** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/198+ **2020-04-24**: This epic is ready for work, but we need to wait until it's closer to 14.0 to start. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/841+ **2022-11-22**: Both changes are now in production. The combined impact has been to reduce redis-sidekiq CPU utilization peaks from 60% to 40%. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/784+ **2022-09-26**: Judging by [TamLand](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1901#note_1114215376) we have achieved a 5-10 percentage point drop in Redis CPU utilization. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/741+ **2022-08-10**: The documentation is now live at https://docs.gitlab.com/ee/development/uploads/working_with_uploads.html. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/742+ **2022-07-13**: All requests with MIME multipart uploads that get pre-processed by Workhorse now first get checked to see if they are from a signed-in user. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/734+ **2022-12-07**: The `background_upload` configuration setting and all code that relies on it are gone. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/652+ **2022-05-10**: The feature flag has been removed. From GitLab 15.0 on, all gitlab-shell <-> gitaly connections use the sidechannel transport, and git fetch via SSH uses sidechannels. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/644+ **2022-05-10**: MailRoom webhook delivery strategy was deployed to all environments. All issues in this epic are resolved. I'll follow up other issues outside of this epic. - :white_check_mark: Pre - :white_check_mark: Gstg - :white_check_mark: Gprd • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/619+ **2022-11-10**: This epic has been completed. We delivered a new Redis deployment on Kubernetes, `redis-registry-cache`. This deployment is used by ~"group::container registry", who they are seeking to increase their usage over the coming months. This work is a first step in migrating stateful production workloads to Kubernetes. By doing so, we eliminate toil related to managing VMs, and pave the way for future migrations to Kubernetes. It also makes it much easier to provision new Redis deployments (see https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/782). Utilization is still very low at this time as only 0.2% of registry traffic is making use of cache thus far. This is expected to increase substantially as additional code paths are rolled out in the registry code. See ([`redis-registry-cache` dashboard](https://dashboards.gitlab.net/d/redis-registry-cache-main/redis-registry-cache-overview?orgId=1)). • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/601+ **2022-04-20**: The OS upgrade is complete on the production servers. 85 servers were upgraded over 3 days. Cleanup tasks are done. The process has been documented for future upgrades, so that it can be executed by the relevant teams when needed. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/618+ **2022-12-01**: Closing this epic in favour of other open initiatives (as per https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/618#note_1193050512). The final scope of this epic was to deliver an instance of Redis on Kubernetes, which was delivered in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/619 with `redis-registry`. There was also an attempt to migrate another instance (`redis-ratelimiting`) in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/782, which, although unsuccessful, provided useful insights for running Redis on Kubernetes. We are now focussed on other Redis initiatives in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823 (long term scalability) and https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/857 (a tactical solution to provide immediate headroom). • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/600+ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/579+ **2022-02-09**: With the Redis Session instance is live and stable, [documentation] (https://docs.gitlab.com/ee/development/redis/new_redis_instance.html#proposed-solution-migrate-data-by-using-multistore-with-the-fallback-strategy) updated, and new instance resized, we consider this project successfully done. Measured impact ([more details](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1396#note_815472671)): - We saved about 51% of persisted data storage for main Redis instance ([source](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1396#note_816555277)) - The total rate of commands dropped to 55% of what it was previously ([source](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1396#note_816538973), [operation graph](https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-90d&to=now&viewPanel=68)) - Network traffic is reduced by ~40% ([network in](https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-90d&to=now&viewPanel=74), [network out](https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-90d&to=now&viewPanel=72)) - We reduced memory footprint and CPU usage per node by ~30%. ([RSS memory](https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-90d&to=now&viewPanel=88), [CPU](https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-90d&to=now&viewPanel=70)) This resulted in reducing the current instance size (and cost) from: **c2-standard-3030** (30 vCPUs, 120 Gb, $902/mo \*3) to **c2-standard-16** (16 vCPUs, 64 Gb, $481/mo \*3) • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525+ **2021-01-21**: Everything is in place for stage groups to opt in to the new SLI. For stage groups, the `puma-apdex` component still exists while they are adjusting urgencies. But for service level monitoring, we are fully relying on `rails_requests-apdex`, which is backed by the new metrics. One bug affecting the new SLI (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1500) was discovered thanks to groups opting in and has been fixed. When all groups have opted in at the end of the quarter (hopefully) (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1343), we'll be able to remove the separate recordings for `puma-apdex` used by the error budgets for stage groups (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1481). Next, we're going to improve the observability by building a focused dashboard for this &664 • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/296+ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/676+ **2022-04-01**: We've made a series of improvements to our interview processes and content. We'll continue to iterate, but the initial burst of changes are complete. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/650+ **2022-04-05**: The last piece of this epic is complete! A tour of how to access and use the periodic profiling data is here: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1433#note_901717231 • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/664+ **2022-03-24**: Dashboards have been created for all groups, and are available with a link from their stage group dashboard: ![image](/uploads/997fd654ad624795655e77879d26e620/image.png) We now also have a tag per group in Grafana, allowing groups to see all their dashboards in one view using a filter: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=5m&search=open&tag=stage_group:Package The documentation for all stage group availability numbers has been updated and is available here: https://docs.gitlab.com/ee/development/stage_group_observability/. The documentation specific to the dashboard is here: https://docs.gitlab.com/ee/development/stage_group_observability/dashboards/error_budget_detail.html Possible follow-up work: 1. scalability#1476 Add detail to puma-errors 1. &700 Make Sidekiq apdex & errors explorable through the dashboard 1. &665 Finish adding the GraphQL SLI 1. &615 Alerts for stage groups • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/250+ **2020-05-19**: Created Epic and added the work resulting from https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/68 • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/613+ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/790+ **2022-12-**: We have a demo with the MVC that we want to share at https://gitlab-com.gitlab.io/gl-infra/platform/stage-groups-index/. This updates daily. We are collecting feedback for possible next steps in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2056. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/810+ **2022-10-20**: The goal of this project was to have labelled resources available in the billing reports specifically for GitLab Pages. * We have data from the load balancers (bandwidth, instance cost) for pages (most of the costs) from all of gstg, pre, and gprd showing up in the billing reports. * We have data from the cloud storage buckets for the gstg, pre, and gprd environments showing up in the billing reports. * There may be some bits and pieces inside of the GKE cluster that could be more directly attributed to pages, but that will get sorted out as part of the GKE cost reporting work in the future. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/614+ **2023-02-22**: We have promoted this to the rest of the stage groups as part of the [monthly error budget report](https://gitlab.com/gitlab-org/error-budget-reports/-/issues/19) and have 11 stage groups signed up in total. We have acted on the [final pieces of feedback](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2092) and this epic is now closed. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/728+ **2022-07-13**: We have refined parts of the capacity planning process and automated items where possible. The documentation has been updated to match what we are now doing because this is a controlled activity for SOC 2 compliance. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/686+ **2022-04-20**: Access was restored for all users who should have access to the continuous profiler tools. We also added support for the container registry and KAS. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/587+ **2022-07-15**: All services achieve Level 1 in the Service Maturity Model | | [Scalability Completed Epics 2023](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1060) ~"group::scalability" | 2023-01-01 | 2023-07-04 | **2023-07-03**: This epic is used to collect all projects that have moved to a ~"workflow-infra::Done" state. This helps reduce the number of epics directly owned by https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/148, thus delaying hitting the [limit](https://docs.gitlab.com/ee/user/group/epics/manage_epics.html#multi-level-child-epics). This epic is `closed` so that it appears in the [:white_check_mark: Completed Work](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/148#white_check_mark-completed-work) section of the parent epic. **Nested Epics: 33** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1012+ **2023-12-06**: This epic can be closed. The original purpose of this epic was to improve the precision of our metrics and eliminate the gaps. The work that we completed in Q3 and the first part of Q4 has mostly accomplished this, with a 3 - 4 digit precision for all metrics and a ruler_evaluation success rate of 99.979% for the past 30 days. As part of the work to resolve the gaps, we have discovered that the metrics environment has several significant scalability issues that we need to resolve. This has been helped by the new re-org and we have a plan going forward. In order to keep each epic focused, we are going to close out this epic and will address the scalability issues for the metrics environment in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1107. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1056+ **2023-10-18**: This project is completed. We added the ability to fine tune forecasts by adding external information to Tamland, added more detailed information to the forecasts and the issues, and refined the process further. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1062+ **2023-11-30**: All Fastly sites have either been decommissioned or migrated to Cloudflare. Fastly account has been closed. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1095+ **2023-11-29**: The infrastructure expectations have been completed and a new runner-manager is setup in the GCP project `gitlab-qa-runners-2`. This runner manager is managed by `chef-repo`, similar to how we do [Distribution](https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/master/roles/build-trigger-runner-manager-gitlab-org.json). Test Platform will pick up the next steps of deploying runners and track that in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1095. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1005+ **2023-09-25**: * We are reorganizing some of the AI issues/epics to represent the current state of work. One outcome is that the ~"group::ai framework" group will be responsible for prioritizing AI initiatives and will reach out to Infrastructure when needed. It is not efficient for Infrastructure to try to keep track of all AI initiatives (and beyond). Instead, we will add improved guidance for when Infrastructure should be engaged https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2526. Further, ~"group::scalability" are continuing to invest in Runway https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/969 and will try to direct future engagements through this channel. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/934+ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/431+ **2023-10-18**: Chef VM were updated in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16444. The application clean up is completed in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/132738. The epic is ready to be marked as ~"workflow-infra::Done" and closed. In summary, this epic was initially created to support &423 and picked up again recently for saturation prevention. As part of &423, the initial work changed [BatchPopQueueing](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/990) and [LimitedCapacity::JobTracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/924). The more recent work [migrated duplicate jobs workload](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/431#note_1582594368) into `redis-cluster-queues-meta`. This both provided CPU headroom and unblocks potential future efforts in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2541#note_1595103776. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1094+ **2023-10-18**: [OKR Progress:](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3492) 92% :white_check_mark: :rocket: The project has been completed. We gained 10% of CPU headroom in redis-persistent from the migration. Graph and detailed results (together with pub-sub migrations) can be found [here](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2155#note_1602403030). This epic can be closed. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1066+ **2023-10-25**: [OKR Progress:](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3492) 95% The clean up for workhorse-related and actioncable-related components have completed. The epic is ready to be marked as ~"workflow-infra::Done" and closed. Both workhorse and actioncable workloads were migrated from ~"Service::Redis" into ~"Service::RedisPubsub". In addition, this epic delivered the first Redis instance to be running in a GKE cluster with Dataplane V2 (with the help from ~"team::Foundations"). • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1064+ **2023-10-18**: - Readiness review is complete. - This epic is complete. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/700+ **2023-08-14**: Note for Grand Review: The epic is done and can be closed. The last piece of the [docs for stage groups](https://docs.gitlab.com/ee/development/application_slis/sidekiq_execution.html) awareness is up. Key changes and benefits were announced in this week's Engineering Week in Review doc. Summary of the impacts: * Stage groups can now see successes and failures (as apdex and error ratio) per worker in the Application SLI Violations dashboard - https://dashboards.gitlab.net/goto/WrYKTre4g?orgId=1. * Infrastructure and Development now use the same definition for sidekiq execution, where before sidekiq execution and queuing were mixed for Infrastructure. * Infrastructure now has separate alerting for sidekiq queuing, this SLI is owned by infrastructure and not stage groups. When stage groups meet the execution SLI, infrastructure should be able to meet the queueing one. * ~2 million fewer metrics emitted from GitLab.com. More histogram metrics can be removed in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2297#list-of-metrics. Future plan for self-managed and Dedicated is discussed in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2474. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/878+ **2023-08-22**: The clean-up is completed and `redis-cache` instance is removed. To summarise the result of this epic: - All cache-related workloads (rate-limiting, cache, repository-cache, chat, and feature-flag) for GitLab Rails are now Redis Cluster compatible. - `redis-cache` is replaced with [`redis-cluster-cache`](https://dashboards.gitlab.net/d/redis-cluster-cache-main/redis-cluster-cache-overview?orgId=1) for over a month with plenty of CPU headroom. - [GDK](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2032) and [Gitlab repo's CI pipeline](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2327) has been updated to improve developer's experience working with Redis Cluster. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/979+ **2023-06-21**: The migration for feature flag workload into `redis-cluster-feature-flag` was completed on 8th June and there has been a [~15% drop in primary CPU saturation ratio](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/979#note_1424024997) on `redis-cache`. The [code/envvar clean-up is completed](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2378). This epic is ready to be closed. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1004+ **2023-06-07**: :rocket: The feature has been used to mitigate a real [S2 incident](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/14758) where a Sidekiq worker is causing `patroni-ci` CPU saturation ([Slack link for the feature flag toggle](https://gitlab.slack.com/archives/C05B215VB0D/p1686043155377219)). This effectively also closes the original [infradev issue](https://gitlab.com/gitlab-org/gitlab/-/issues/408520#note_1420770682). With that, this epic can also be closed. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1010+ **2023-06-07**: In summary, we have provisioned a new Redis Cluster for the AI chat feature in ~7 business days, including an upstream patch to our charts repo. To facilitate future provisioning efforts, the findings from this epic are summarised into a [provisioning guide](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/redis/provisioning-redis-cluster.md). It is ready and being actively referenced for &979. All issues closed. Epic can be closed as part of Grand Review. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/932+ **2022-03-29**: After https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2245, all our redis services are on level 3 (except for redis-registry-cache, which needs more product-side documentation) • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/937+ **2023-03-20**: This epic is used to collect all projects that have moved to a ~"workflow-infra::Cancelled" state. This helps reduce the number of epics directly owned by https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/148, thus delaying hitting the [limit](https://docs.gitlab.com/ee/user/group/epics/manage_epics.html#multi-level-child-epics). This epic is `closed` so that it appears in the [:x: Cancelled](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/148#x-cancelled) section of the parent epic. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/918+ **2023-05-02**: The problems with these metrics were caused by increased cardinality because of the introduction of a new service that would serve Rails traffic. So the first step we did in this project was reduce the overall cardinality so the metrics for error budgets for stage groups would correctly aggregate again. Then we worked on improving part of the implementation of how these metrics are recorded by separating them by environment. This allowed us to change the evaluation strategy from `abort` to `warn`. Meaning rule evaluation would fail, and we'd be notified, instead of silently being incorrect. While we built the method to do this, we also applied this to other autogenerated recording rules in Thanos. We also improved SLIs for rule evaluation to be notified when this happens again, instead of being notified by our users noticing incorrect data. We've decided to cancel the addition of a new saturation point that would try to predict too many high cardinality metrics in a recording rule. We did this because we noticed that our new SLI (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2204) would alert us of problems, and because there's more work planned with regards to capacity planning in kubernetes (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2271#note_1369996867) that would likely improve things for Thanos & Prometheus running there as well. This project is done, error budgets in the last 2 reports have been reliable, and we put SLIs in place to alert us when recordings start to fail again. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/877+ **2023-02-21**: We shifted the load balancing workload to the new redis-db-load-balancing shard on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8393 with no major hiccups. As a result, the Hard Threshold and Saturation date predictions were changed to `No forecast`, with CPU saturation now having a maximum of ~60%. See https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/877#note_1281868605 for more data on the impact. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/899+ **2023-03-08**: @cfeick: The epic is completed by due date. Below is a summary of impact: - **Service Ownership**: Established and assigned ownership for every service in service catalog. Collaborated w/ newly formed Reliability teams to set expectations for service owners. Unlocked future opportunities for providing value to service owners, e.g. auto-assigning capacity planning issues, service maturity model, etc. - **Service Catalog Validation**: Used JSON Schema to validate service catalog. Added validations for every field in service catalog. Consolidated duplicate validation efforts. - **Service Catalog** **Documentation**: Used JSON Schema to annotate service catalog. Added descriptions for every field in service catalog. Added documentation on service catalog tooling. - **Service Label Automation**: Autogenerated scoped service labels for every service in service catalog. Updated runbooks documentation to use service labels. - **Service Catalog Tech Debt**: Removed 2,000K lines of unused fields and inaccurate data from service catalog. Simplifed service catalog format to reduce cognitive overhead. Prevented future proliferation of unstructured fields with validation. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/897+ **2023-03-09**: This is closed in favour of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/928+ and https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/929+, which are there to capture the actual work we're going to do. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/896+ **2023-05-31**: The related Reliability issue has been completed. I am waiting to confirm if the change issues can also be closed out. ([comment](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17075#note_1414432549)) • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/886+ **2023-03-02**: We have achieved our first goal &887 of removing the need for Omnibus and Charts changes. The most recent functional sharding exercise &877 already used the work of &887. Overall, &877 went quite fast and we feel we don't need to invest more in making the process faster at this time. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/875+ **2023-01-30**: @cfeick: The epic is completed by due date. Below is a summary of impact: - ****Scalability Facilitation****: Capacity planning issues generated by Alertmanager lacked the ability to deduplicate alerts and provide metadata. Results include functionality to [programmatically manage issues via Tamland](https://gitlab.com/gitlab-com/gl-infra/tamland#architecture), removal of duplicate issues, and unlocked opportunities for future enhancements, such as applying service labels (e.g. `~service::Foo`), assigning issues to engineers/teams, and more. - **Reduced Toil**: Capacity planning issues require investigations by engineers on rotation. Results include issues with rich metadata to provide additional context, such as [saturation forecast labels](https://about.gitlab.com/handbook/engineering/infrastructure/capacity-planning/#saturation-labels), violation forecast labels, improved description format w/ dates/images, and issues linked by service. Initial feedback from 2 engineers suggests these changes are a productivity boost. - **Prioritization Framework**: Capacity planning is a nuanced activity that depends on factors like scalable resources (e.g. `non_horizontal` vs. `horizontal` ) and saturation forecast threshold (e.g. `100% saturation` vs. `hard SLO violation`). Results include an initial iteration of project board based on [Eisenhower matrix](https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/boards/5273449), functionality to sort issues by [prioritized label](https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/?sort=label_priority&state=opened), and [documentation](https://about.gitlab.com/handbook/engineering/infrastructure/capacity-planning/#prioritization-framework) based on discussion. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/857+ **2023-02-17**: All cleanup tasks including documentation have been completed. We gained nearly 30% CPU for redis-cache as part of this effort. [We split redis-repository-cache from redis-cache on 2023-01-31](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/860). • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823+ **2023-04-27**: This epic is complete! This week we finished the wrap-up tasks (decommissioning old VMs, tidying the configs, updating docs, etc.). As previously noted, the cluster has been healthy and stable since rollout. However, the apdex metric is currently inaccurate. It shows a much worse value than the real one. We are taking that as a follow-up task outside of this epic, since it appears to be a general problem affecting several apdexes. More details below. We incidentally discovered what appears to be a defect in the apdex metric. This defect is making the apdex appear erratic when it is really much more stable and healthy. For example, yesterday's apdex jitter ranged between 98.4% (bad) and 102% (nonsense). In contrast, the correct apdex calculation -- which we get by recomputing the apdex from its raw underlying metrics (redis request durations) -- shows at worst four nines: 99.997% of requests were fast enough to meet the apdex thresholds. We will continue investigating this measurement error outside of this epic (in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2319), since we think it probably affects other apdex calculations too. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/821+ **2023-03-03**: This project is complete: ops.gitlab.net is used solely for querying Thanos and populating the cache. All use of that cache, including generating the website and managing capacity planning issues, happens in a GitLab.com pipeline. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/811+ **2023-04-25**: Completed. All label drift has been remediated in production and we have a check in ci (using checkov) that will confirm that gl_product_category is set on all applicable resources going forward. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/756+ **2023-04-05**: This epic is completed. We have added table size, cost, and bloat ratio to the stage group index subpages, along with documentation on the choices made. We have gotten feedback from the stage groups as well and will use that to help prioritize the next tasks. [You can see an example of the full results on the source code page.](https://gitlab-com.gitlab.io/gl-infra/platform/stage-groups-index/source-code.html) • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/743+ **2023-03-23**: Due to connection pool saturation issue--@smcgivern wrote a great explanation about it https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2107#note_1308936990 --it's not possible to tighten the request urgency for the endpoint [Repositories::GitHttpController#info_refs](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2107). For this reason, I'm setting a new due date to investigate that endpoint in May, when the connection pool saturation will eventually be resolved, and with that we can close this epic. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/663+ **2023-05-04**: The stage groups have now [a detailed dashboard](https://dashboards.gitlab.net/d/general-rails-endpoints-violations/general-rails-endpoints-violations?orgId=1) to inspect the error budget ratios, overall and per endpoints. It will be announced in [the error budget report](https://gitlab.com/gitlab-org/error-budget-reports/-/issues/23) \o/ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/596+ **2023-05-02**: We have wrapped up the project with the [last MR merged](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111675) to [reintroduce routing rules by default](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1491) for self-managed instances in 16.0. The queue selectors have been deprecated in 15.9 with removal in 17.0. Removal progress will be tracked in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2220. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/566+ **2023-02-02**: We rolled out the new backup script on production in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8273 with no hiccups. The rollout yielded the benefits we expected: We saw the periodic Apdex dips disappear right as we rolled out the change, which correlated with a decrease on load per core on that cluster. The end result was that we stopped seeing the lump of requests taking exactly 5 seconds that was caused by the resource saturation that RDB snapshotting incurred on. For the full post-evaluation, see https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1254. To close the Epic, we updated our documentation, metrics and alerting to aid EOCs in case of failure in the newly introduced snapshotting scripts. | | [Scalability Completed Epics 2024 - Practices](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1265) ~"team::Scalability-Practices" | 2023-08-07 | 2024-03-11 | **Nested Epics: 12** • https://gitlab.com/groups/gitlab-com/gl-infra/platform/runway/-/epics/1+ **2024-05-03**: **Closing summary** The project milestones have been marked as complete and all issues have been marked as done, so this **epic is complete**. During Grand Review, please close the epic. **Problem statement** Pipeline Validation Service (PVS) is a Cloud Run service that is deployed through `gcloud` commands in `.gitlab-ci.yml` (with CI/CD variables for secrets) to a single environment (production) with no SLI/SLO monitoring. **Project Summary** By onboarding PVS onto Runway, we address the above-mentioned shortcomings as the service is now deployed to two environments (staging & production), improved monitoring & alerting, tightened security (internal-only facing service + use of Vault for secrets) and better documentation by leveraging the standard [Runway docs](https://runway.gitlab.com). **Project impact** | Before | After | | ------ | ------ | | Support for a single environment (production) | Deployments promoted through staging then production | | Single tag for images | Proper revision tagging for all container images | | No active SLIs/SLOs | More metrics, dashboards, and alerts | | Minimal saturation monitoring | Capacity planning for all Cloud Run resources | | Minimal documentation on infra | Leverage [Runway documentation](https://runway.gitlab.com) | | GitLab CI variables for secrets | Runway uses Vault as a secrets engine | | Publicly exposed load balancer | Internal-only facing load balancer accessed via VPC peering | As part of this project, we established VPC peering between GitLab Staging & Production GCP projects to Runway, and we added support for deploying an internal load balancer, so any Runway service can now deploy an internal load balancer reachable from the Rails side over private network paths. **Project final status update** Thank you @cfeick for the collaboration, @kwanyangu for your support and the Govern:Anti-Abuse team for working together to get this over the line. Thanks also go out to several SREs who helped with reviewing MRs & providing feedback. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1105+ **2024-06-12**: Project was completed with a spillover into Q2, and KR progress was marked as 100% completed. Below is the closing summary: • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1222+ **2024-05-01**: The project has been completed with a KR completion ratio of 100%. Closing summary below. **Problem description** Before this work, Runway had numerous rough edges, such as manual tasks during the deployment process, that were error prone and slowed down development. A secondary goal was to act as @fforster's starter project. **Changes overview** In this quarter, we implemented various improvements: we moved CI templates into the `runwayctl` repository and turned down the `ci-tasks` repo, removing a manual deployment step and allowing for atomic changes to Runway. Several static checks, e.g. *yamllint* and *CI lint*, and dynamic checks, e.g. dry-run deployments, have been added to the merge request pipelines, reducing mistakes and allowing us to move faster. Ultimately, the testing improvements allowed for a timely delivery of multi-region support (gitlab-com/gl-infra&1206). **What's next** We aim to improve the deployment process to deploy region-by-region, with proper canarying and automatic roll-backs on error. This will also allow for per-region configuration, which is a feature that has been requested by our users. **Thanks** Shout out to @igorwwwwwwwwwwwwwwwwwwww for being a great onboarding buddy, and @cfeick for his many reviews and great recommendations! :tada: Grand Reviewers, please do the honors of closing this epic. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1220+ **2024-05-02**: The project has been completed by estimated due date of April 30th, 2024. The KR progress has achieved 100% completion. Below is closing summary. **Project problem**: Sidekiq performs bulk of the database-heavy workload and is a source of pressure for patroni and pgbouncer. Any inefficient Sidekiq workload which involves holding database transactions for an extended period of time could cascade into a severe incident like in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17504. **Project summary**: The epic is completed. Gitlab Rails now emit application SLI metrics which is tracked in patroni and patroni-ci's dashboard. The SLIs do not contribute to the patroni/patroni-ci's apdex at the moment, but is viewable on the [service dashboards](https://dashboards.gitlab.net/d/patroni-main/patroni3a-overview?orgId=1&viewPanel=1142257756&from=now-7d&to=now/m). The application SLIs can be broken down by feature category and worker, enabling stage groups to narrow down on workloads which have a higher tendency to violate the threshold for database transactions. This is not the final outcome as we intend to follow-up on this as described below. **Follow-ups**: Fine-tuning the SLI to incorporate it into patroni's service apdex and stage-group attribution in error budgets will be covered in another [epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/951) when we have collected more data (looking at weeks/months to cross-reference incidents, if any, to examine how apdex/error budgets would have differed). See https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1220#note_1814298341. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1218+ **2024-05-15**: The project has been completed by estimated due date of May 15th, 2024. The KR progress has achieved 100% completion. **This project can be marked as completed and closed during the grand review.** Below is closing summary. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1206+ **2024-05-01**: **The project has been completed by estimated due date of April 30th, 2024. The KR progress has achieved 100% completion. Below is closing summary.** **Project problem** Previously, Runway services were deployed to single region in `us-east1` . The problems of single region deployments are latency and availability. The opportunities of multi-region deployments are decreasing latency by serving traffic from nearest regional backend, and increasing availability by supporting region failover during outages. **Project summary** Runway Multi-Region is now Generally Available (GA) and meets production-readiness standards for scalability and observability. Runway provides functionality to provision and serve traffic from multiple regions in GCP Cloud Run. Available locations include up to 40 regions closest to customers. **Project impact** Runway has delivered a stable, observable, and scalable long-term platform solution that powers multiple services deployed on paved road in SaaS Platforms. To reiterate: results are entirely automated and reusable outside context of AI Gateway, including upcoming Cells Topology service. In fact, Runway’s documentation is [multi-region deployment](https://gitlab.com/gitlab-com/gl-infra/platform/runway/docs/-/blob/master/.runway/runway.yml?ref_type=heads#L11-13) used for dogfooding. By rolling out `us-east4` , `europe-west2`, and `asia-northeast3` regions for AI Gateway, Runway has delivered the [first multi-region service in production](https://console.cloud.google.com/run?project=gitlab-runway-production&pageState=(%22cloudRunServicesTable%22:(%22f%22:%22%255B%257B_22k_22_3A_22Name_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22ai-gateway_5C_22_22_2C_22s_22_3Atrue_2C_22i_22_3A_22name_22%257D%255D%22))) at GitLab. We've enabled AI Gateway maintainers to self-serve additional regions as service grows without infrastructure being bottleneck. As Stable Counterpart for AI Gateway, we’ve also updated AI Gateway to route Vertex AI requests to nearest region based on Cloud Run region. As a result, AI Gateway has full end-to-end multi-region support for both Cloud Run and Vertex AI. **Project artifacts** To self-serve multi-region deployments, refer to documentation: https://docs.runway.gitlab.com/guides/multi-region/. The platform experience is as simple as adding few lines to `runway.yml` [service manifest](https://gitlab-com.gitlab.io/gl-infra/platform/runway/runwayctl/manifest.schema.html#spec_regions) that will configure [global load balancer](https://dashboards.gitlab.net/d/ai-gateway-main/ai-gateway3a-overview?orgId=1&viewPanel=94) to route HTTP requests to nearest [regional service](https://dashboards.gitlab.net/d/ai-gateway-regional/ai-gateway3a-regional-detail?from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gprd&orgId=1&viewPanel=3897771623). **Project final status update** The project milestones and exit criteria have been marked as complete. During Grand Review, please close the epic. In the spirit of iteration, GA is just the beginning and we will continue to make investments for service owners by improving and expanding platform capabilities. Thank you to all participants for your contributions. A project of such complexity and urgency could not have achieved customer results so quickly without collaboration and transparency from many team members across Scalability group, SaaS Platforms, Test Platforms, and AI-powered stage. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1236+ **2024-04-11**: All clean up is done. ~"Service::RedisRepositoryCache" has been removed and **this epic is ready to be mark as completed and closed**. To summarise: ~"Service::RedisClusterRepoCache" is now serving repository-cache related traffic ([dashboard](https://dashboards.gitlab.net/d/redis-cluster-repo-cache-main/redis-cluster-repo-cache3a-overview?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gprd&var-shard=All&from=now-6h&to=now)). The repository cache workload is now horizontally scalable. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/944+ **2024-01-10**: **Done** The charts MR is merged (https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/3479#note_1701872604). To summarise the work done in this epic: we have deprecated the use of `redis-namespace` in Sidekiq for gitlab-rails (only, note that mailroom uses `redis-namespace` for arbitration still). We have also released the deprecation successfully on omnibus and charts to allow zero downtime upgrades starting from from 16.4 to 16.8. This unblocks the Sidekiq 7 upgrade (&941) and horizontally scaling Sidekiq (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2541). This epic can be marked ~"workflow-infra::Done" and closed. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1055+ **2024-01-31**: [OKR Progress:](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3492) 100% The application and runbook clean up has completed (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2651). We are ready to mark this epic as ~"workflow-infra::Done" and close it. **Results** We now have ~"Service::RedisClusterSharedState" serving shared state workload from the monolith, replacing Redis persistent or ~"Service::Redis". It serves as a "catchall" instance from the application standpoint with the most varied workload and most demanding persistence/consistency requirements compared to cache and rate-limiting. With this, about ~50% of Redis traffic (request per second) in GitLab Rails is now being served by a Redis Cluster ([source](https://thanos-query.ops.gitlab.net/graph?g0.expr=sum(rate(gitlab_redis_client_requests_total%7Benv%3D%22gprd%22,%20storage%3D~%22cache%7Cchat%7Cfeature_flag%7Cqueues_metadata%7Crate_limiting%7Cshared_state%22%7D%5B10m%5D))%20/%20sum(rate(gitlab_redis_client_requests_total%7Benv%3D%22gprd%22%7D%5B10m%5D))&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=sum(rate(redis_commands_total%7Benv%3D%22gprd%22,%20type%3D~%22redis-cluster.*%22%7D%5B10m%5D))%20/%20sum(rate(redis_commands_total%7Benv%3D%22gprd%22%7D%5B10m%5D))&g1.tab=0&g1.stacked=0&g1.range_input=2d&g1.max_source_resolution=0s&g1.deduplicate=1&g1.partial_response=0&g1.store_matches=%5B%5D)) and is horizontally scalable when saturations are forecasted for cpu and memory (pending https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1105). The current state of ~"Service::Redis": it contains KAS and buffered counter workload. There is an epic https://gitlab.com/groups/gitlab-org/-/epics/9412 for providing KAS its own Redis. **Follow-up work by scalabiltiy** As explained in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1055#note_1691838229, buffered counters workload still remain on ~"Service::Redis". Work to migrate it will commence separately and be tracked in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2688. Once KAS and buffered counter workloads are migrated out of ~"Service::Redis", we can decommission it. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/941+ **2024-03-06**: The `redis` gem upgrade has been successfully deployed in a CR (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17640). This epic is considered done. We have upgraded both `sidekiq` and `redis` gems to the latest major version, unblocking scaling efforts like [horizontally scaling sidekiq](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1218) and [scaling out Redis Cluster](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1105). Having waited 1 week to observe the upgrade in production, we can proceed to close this epic as part of the grand review. • https://gitlab.com/groups/gitlab-com/gl-infra/platform/runway/-/epics/2+ **2024-01-16**: Epic is ~"workflow-infra::Cancelled". For context, refer to closing comment: https://gitlab.com/groups/gitlab-com/gl-infra/platform/runway/-/epics/2#note_1729731018. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1200+ **2024-01-31**: The epic was completed, and all the goals were achieved in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1200#note_1755128908. **Results** Successfully enhanced security and traceability for CustomerDot database and Rails access. [Teleport was deployed in staging and production](https://gitlab.com/gitlab-org/customers-gitlab-com/-/blob/main/doc/setup/teleport.md) to ensure all direct interactions with the database are attributable to specific users. A [process was established to link all database changes to corresponding change issues](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/customersdot/overview.md), ensuring full traceability. An emergency change process was also implemented, allowing for retroactive approvals and documentation of changes made outside the Merge Request (MR) process. Additionally, proactive monitoring was set up to detect unauthorized database activities, with audit logs directed to the SIRT team. SSH access to customerDot VMs is blocked for all users except for SREs. | | [Scalability Completed Epics 2024 - Observability](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1266) ~"group::Observability" | 2023-09-14 | 2024-03-11 | **Nested Epics: 6** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1107+ **2024-05-24**: Mimir is now the production source of metrics. All human-facing metrics have been moved to Mimir with minimal impact. We have a few automated systems that we will be handling as part of the epic to decommission the Thanos environment (along with the other ones we will likely find as part of that work). We have achieved 4 digits of precision and are no longer suffering from metrics gaps. [Graphs and details above.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1107#project-impact----technical) The walkthrough video is available on youtube: https://youtu.be/q0oXICKiNXg **Grand Reviewers, this epic is ready to be CLOSED! :tada: ** **Shout Outs** To Andrew for creating all of this in the first place and giving us a base to work from! To Bob and Matt for starting this project with me a year ago when we thought it was just going to be refactoring apdex calculations, and again to Bob for sharing all his knowledge with Marco and Hercules so that we're better able to support this system. To the entire Scalability:Observability team for persevering through the longest project that many of us have ever seen, with twists and turns that no one expected while we kept alive a metrics system that was on its last breath. Every single person on the team touched this project in some fashion or another. Particular credit to Nick who started this project in Reliability:Observability, and to Marco and Hercules who took on a mountain of jsonnet tech debt and whittled it down to a molehill. To Runway and Observability teams who showed how much easier it is to onboard to the new system. And to the entire Scalability and Platform leadership team for allowing us the time and space to finish this off correctly and get our metrics into a good place to build upon for the future. **The original problem** We discovered in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1012 that the metrics stack consisting of Thanos and Prometheus was suffering from a large number of scale related issues. These issues manifested in three main ways: metrics gaps when recording rules fail, increased number of metrics incidents, and increased management and workload for the Observability teams. We attempted to improve our current environment by [deploying sharded Thanos ruler and Prometheus remote write](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/24590) but in September, we discovered [significant bugs](https://github.com/prometheus/prometheus/issues/7912) and [performance problems](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24515#note_1599511675) with those systems that meant that it was worthwhile to investigate a different solution. **A brief description of the changes made** We [deployed Mimir](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/24606) as our new global view for GitLab.com metrics. Mimir is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus. As part of this work, we also [refactored how we write metrics into our systems](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1232), migrating from a mix of Prometheus instances and Chef relabelling to a Kubernetes based system with Prometheus agents and Terraform labels. We also [refactored the recording rules within the metrics system](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1233) to make them all run within one environment and run faster and with less impact to the system. This work had a huge technical and team based impact. [More details on the outcome can be found above](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1107#outcome). • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1230+ **2024-04-24**: We've used the [data gathered](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1768#precision-and-rated-per-week) here to inform project plans for improving forecasting overall. Quality-wise, a short summary is that "it's fine except for known-bad cases". The known-bad cases are typically systemic and if we improve conceptually, we will be able to provide better forecasts for these components. Many components get good and useful forecasts and this does in fact drive early detection and prevention of issues on GitLab.com (example [one](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1668#note_1807225359) and [two](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1712) and [three](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1767) - all in the last few months). This leads us to the following work items, which we're also discussing in context of [upcoming OKRs and projects](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3406): * https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1287+ * https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/946+ In areas with good forecast quality, we are considering to involve Service Owners as the primary owners of capacity warnings, see https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1289+. In areas we still need to improve forecasting conceptually, we may need to delay this and continue to have Scalability review capacity warnings produced (or disable forecasting for these components entirely because it doesn't make sense, cf the [elevator pitch in this epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1287#elevator-pitch)). • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1237+ **2024-04-17**: The last piece of work to https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3403+ has been merged. This epic can be closed and with ~"workflow-infra::Done" applied. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/928+ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1167+ **2024-02-07**: I am closing this epic. Many of the initial questions raised by this project have been answered. Additionally, ~"group::scalability"'s involvement with Cells has been further formalized since this issue was created. As an example, we have an OKR to created an Observability Blueprint for Cells https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/6071. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1164+ **2023-12-06**: Rolling this into the greater epic https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1107 which addresses both the infrastructure work to roll out mimir and the runbooks efforts to move the recording rules from prometheus to mimir. | | [Counterpart: Infrastructure Support for Product Analytics](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1284) ~"team::Scalability-Practices" | 2024-04-01 | 2024-09-02 | **2024-08-29**: The project has been completed. The counterpart team has received infrastructure, resources, and monitoring, and we are now awaiting full adoption. Below is the closing summary. | | [Counterpart: Implement immutable audit trail for SRE support on CustomersDot](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1261) ~"team::Scalability-Practices" | 2024-05-01 | 2024-08-09 | **2024-08-09**: - All issues are done. - `GitLab: CDOT Infra (GCP) ITGC FY25 SOX Walkthrough` scheduled for 2024-08-27 - Project can be closed | | [Improve the Runway deployment process](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1329) ~"team::Scalability-Practices" | 2024-05-01 | 2024-09-05 | **2024-09-04**: This project has been completed and can be closed during the Grand Review. A big shout out to @marcogreg who came in as a borrow from ~"team::Scalability-Observability" and helped ship awesome results :rocket: Below is a summary of the results | | [Decommission old metrics infrastructure](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1344) ~"group::Observability" | 2024-05-27 | 2024-07-26 | **2024-07-26**: Closing status: The Thanos/Prometheus stack has now been decommissioned. A summary of cost savings can be found in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3679+. This can be marked as ~"workflow-infra::Done" and closed as part of the Grand Review. | | [Separate environments for staging and production AI Infrastructure](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1332) ~"team::Scalability-Practices" | 2024-05-27 | 2024-12-13 | **2024-12-11**: All remaining issues have been closed out. [Documentation has been updated](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/merge_requests/1741). This epic is complete. _Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1332#note_2251899591_  | | [Expand Platform Engineering to more runtimes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1330) ~"team::Scalability-Practices" | 2024-06-05 | 2024-08-09 | **2024-08-07**: **🤔 Problem description:** Figure out how GitLab's internal developer platform, Runway, can be evolved to meet the needs of self-managed customers, and build a prototype. Runway is an internal developer platform allowing teams at GitLab to easily run services on GCP's Cloud Run. However, some large self-managed customers are unable to use SaaS-style services, e.g. due to regulatory requirements, and are asking to run AI Gateway themselves. The goal is to create a sharable artifact, e.g. a Helm chart, while maintaining the convenience and safety for deployments operated by GitLab. **🏗️ What we did:** * Turned up a GitLab instance "by hand", including runners on a Kubernetes cluster and the GitLab agent for Kubernetes, to better understand the end-user perspective. * Published [Runway for Satellite Services Vision](https://docs.runway.gitlab.com/reference/blueprints/satellite-services-vision/), outlining our long term vision and the value provided by increments on the path towards that vision. * Published [Runway for GKE Blueprint](https://docs.google.com/document/d/1Q25BxXJU1CQaez8sl0emcMKgwMgYKSC7km6YC4wVklU/edit?usp=sharing), outlining an implementation of the first iteration described in the vision document. * Built a prototype using the self-managed GitLab test instance and GKE cluster, and Flux. * Contributed some [Flux fixes](https://github.com/fluxcd/source-controller/pull/1529) upstream. * Added common CI tasks, allowing Runway services to easily generate a Helm chart. The generated Helm chart is based on `runway-base`, providing the common structure for all Runway-mangaed services. * Added deployment CD jobs for generating Kubernetes resource manifests from the Helm chart and uploading them to an OCI repository. * Created shared Runway GKE clusters using the Runway Provisioner. **💡 Findings:** * There is ample appetite in GitLab for a platform giving teams the option to run services instead of adding to the monolith. * The KISS principle requires a conscious effort in this space. Complexity creeps in with every step. * Helm templates are horrible to debug. * Helm charts don't handle custom resource definitions well. * The multi-level kustomization scheme proposed by https://github.com/fluxcd/flux2-multi-tenancy is impossible to reason about, making it inoperable. **🔜 Next steps:** (see gitlab-com/gl-infra/platform/runway&7) * Identify an (internal) customer with which to build a pilot. * Implement the identified gaps to deploy the "example service" to GKE via Runway. * Identify and close gaps to unblock adoption by the pilot customer. * Time permitting, work on first GA features (progressive deployment, monitoring, automatic rollbacks, …) **:tanuki-colored-heart: Thanks** * @schin1 for the many, many code reviews and great questions * @pguinoiseau for his help in creating the shared Runway GKE clusters * @f_santos for sharing his insight, particularly on Helm releases, Flux, and their shortcomings * @andrewn for his guidance and encouragement * @swiskow for the product side support * @igorwwwwwwwwwwwwwwwwwwww and @gsgl for their input on the blueprint **:checkered_flag: Grand review:** please do the honors of closing this epic. | | [Runway Maintainership Program](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1331) ~"team::Scalability-Practices" | 2024-06-12 | 2024-08-01 | **2024-07-31**: Previously, Runway was created by various engineers using borrow-process and lacked formal maintainers. In this project, we defined [Runway maintainers](https://handbook.gitlab.com/handbook/engineering/projects/#runway-reconciler) for engineering projects, configured [reviewer roulette](https://gitlab-org.gitlab.io/gitlab-roulette/?currentProject=runway-reconciler) for code reviews, created [issue template](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/new?issuable_template=maintainer) for accelerated onboarding, and documented process for [maintainership](https://docs.runway.gitlab.com/team/maintainership/). The impact of this project project is that now anyone can contribute to Runway and become a maintainer: https://gitlab.com/gitlab-com/gl-infra/platform/runway/runwayctl#how-to-become-a-project-maintainer. | | [Dynamic dimensional expansion of saturation forecasts for capacity planning purposes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1287) ~"group::Observability" | 2024-06-19 | 2024-11-01 | **2024-10-29**: :white_check_mark: Project is done! A note with [closing remarks](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1287#note_2179498636) has been added with more details of upcoming work and follow-ups. See you on the next project :wave: | | [Create technical blueprint for logging](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1334) ~"group::Observability" | 2024-06-24 | 2024-08-09 | **2024-07-31**: :tada: Grand reviewers, this epic can be closed! The logging blueprint has been merged and can be found in the handbook at https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/logging/. We'll be working on fleshing out the next steps and prioritizing them. | | [Runway: Jobs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1352) ~"team::Scalability-Practices" | 2024-07-01 | 2024-09-27 | **2024-09-25**: | | [Measure cost of metrics stack](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1341) ~"group::Observability" | 2024-07-01 | 2024-09-05 | **2024-09-04**: Grand Reviewers, the last part of this epic [has been completed](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3722). This epic can be moved to ~"workflow-infra::Done" and closed. :tada: | | [Audit production usage of Loki](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1363) ~"group::Observability" | 2024-07-08 | 2024-07-29 | **2024-07-25**: The architecture diagram can be found at https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/loki?ref_type=heads#architecture. Initial findings can be viewed at https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1363#note_1994227605 (same as `Status 2024-07-17`). This can be marked as ~"workflow-infra::Done" and closed as part of the Grand Review. | | [Decouple Runway's deployment dependencies from gitlab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1348) ~"team::Scalability-Practices" | 2024-08-01 | 2024-09-12 | **2024-09-11**: Runway documentation is updated with onboarding [guide](https://runway-docs-4jdf82.runway.gitlab.net/guides/deploying-from-ops/) and [reference](https://runway-docs-4jdf82.runway.gitlab.net/reference/deploying-from-ops/) docs. **Runway now supports availability-critical services by being able to fully decouple from gitlab.com in the event of an outage and safely deploy from ops.gitlab.net.** Day-to-day developer experience is unhindered as they can continue to create merge request pipeline and view dry-run jobs on gitlab.com without needing ops access. As discussed with `@daveyleach`, we will onboard topology-service at a [more appropriate phase](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/356#note_2097476983) since it is not needed at the experimental stage. I suspect there will be cruft/teething issues when we do migrate topology-service to ops.gitlab.net. But there should be nothing major since the critical blockers like canonical-to-ops push mirror behaviours, canonical dry-runs and actually deploying from ops.gitlab.net are in place. This epic is ready to be marked as done and closed. _Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1348#note_2099074450_  | | [Address expensive unused metrics to optimize Mimir costs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1362) ~"group::Observability" | 2024-08-01 | 2024-10-25 | **2024-10-23**: [Closing Summary](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1362#note_2170394926) This is read to be closed post grand review. (Not sure if it needs to be left open for it to show up there). _Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1362#note_2172271372_  | | [Counterpart: AI Framework - Runway Contributions to Duo Workflow Service](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1390) ~"team::Scalability-Practices" | 2024-08-12 | 2024-08-22 | **2024-08-21**: Duo Workflow has been deployed to both staging & production. We're still waiting for the [OKTA Group](https://gitlab.com/gitlab-com/business-technology/change-management/-/issues/1173) to be created so that secrets can be managed by the service owners, but it's not a blocker. Grand Reviewers, please close this epic. _Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1390#note_2064383369_  | | [Stabilize Sentry](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1401) ~"group::Observability" | 2024-09-16 | 2024-11-08 | **2024-11-06**: This is ready to be closed in the Grand Review, you ran read the closing status [below](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1401#note_2199546148). _Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1401#note_2195591892_  | | [[Scalability::Practices] Improve availability of Sidekiq to >99.95%](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1219) ~"group::scalability" | | 2024-06-07 | **2024-06-04**: The project has been completed and can be closed during the Grand Review. Below is the closing summary. | </details> ### :x: Cancelled Epics that were cancelled <details> | **Topic** | **Ended** | **Summary** | |-----------| ----------| ------------ | | [[POC] Improved Log shipping experience through Loki](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1037) | 2024-06-25 | **2023-02-07**: ~"workflow-infra::Stalled" Define our near-term direction in this issue: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2761 We have agreed that we should pause the Loki effort until we have a better understanding of overall team priorities. There are lots of moving parts at the moment, and it seems our biggest immediate focus should be on improving the stability of our monitoring stack. It doesn't make sense to just have 1 person working on the Loki effort, neither can we justify ramping up investment right now. That doesn't mean we won't come back to it. But we need to better understand how it fits in longer term - this in part will be answered as part of our FY25 plan https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2672. | | [Sidekiq and Database Performance Optimization](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1395) | 2024-11-12 | **2024-10-23**: - [Separate pgbouncer pools](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1682) has been [rolled out to production on 2024-10-14](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18674) so we now have separate pgbouncer pools for urgent vs non-urgent workloads. _Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1395#note_2172271521_  | </details> ### :rotating_light: Epics that need attention These linked epics are not in the correct state or missing a workflow label <details> | **Topic** | **Links** | **Reason** | |-----------|-----------|-------------| | [Practices::Runners(SaaS) No Components out of date](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1121) @rehab (+0) team::Scalability-Practices | Epic has ~"workflow-infra::Stalled" but is closed | | [Migrate actioncable workload into a dedicated Redis instance](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1302) (+0) team::Scalability-Practices | Epic has ~"workflow-infra::Proposal" but is closed | | [Roadmap for Observability](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1241) (+0) group::Observability | Labeling problem, epic has no workflow label | | [[DRAFT] Scalability deliverables for Cells 1.0 GA](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1198) (+0) group::scalability | Epic has ~"workflow-infra::Proposal" but is closed | </details> ## Open Work outside of Scalability Issue Tracker Due to a limitation in issues not being accessible to epics/boards in different groups, the epics and issue boards below are used to track work that the Scalability team creates. - Board showing all gitlab-org issues : https://gitlab.com/groups/gitlab-org/-/boards/1477989 - Summary of issues outside of the Scalability Issue Tracker: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/430

epic