Production Engineering Ops
:mega: **This epic is no longer in use with [changes to the group team structure](https://docs.google.com/document/d/1Mi45xQnKOgkD47WLnz4JUMFZOBeb32zxDuk4OfUiSDQ/edit?usp=sharing). For Incident Management and Disaster Recovery projects and updates, please see https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1676+.** ## Overview <!-- STATUS BEGIN --> This epic is a the high level epic for all project work across the Ops team. ## Project Work ### :white_check_mark: Completed Work Items that have been completed <details> | **Topic** | **Started** | **Ended** | **Summary** | |-----------| ------------| ----------| ------------ | | [FY25 Zonal outage resilience](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1250) <br/> ~"group::Networking & Incident Management" | 2024-02-27 | 2024-07-31 | **2024-07-24**: Final Status Update:<br><br>1. **Original Problem**: The GitLab.com DR plans for recovery from a zonal outage is unknown, undocumented, and untested.<br>2. **Changes Made**: Game days were used as a driving force to build up processes and test them in staging. Measurements were taken to create a framework to plan responses and measure our improvements per component.<br>3. **Impact**: Several components (Patroni, PGBouncer, and HAProxy) were moved from `No Confidence` to `Low Confidence` and some (Gitaly and Regional Clusters) were moved to `Medium Confidence`. [Reference Spreadsheet](https://docs.google.com/spreadsheets/d/16AVXetqTae2eTarJIg9CGJkvRrsz3Fh9RdFZ-0b48nY/edit?usp=sharing). Our verification of processes in staging helped provide data for our [FY25 ISCP Test Report Approval](https://gitlab.com/gitlab-com/gl-security/security-assurance/team-commercial-compliance/iscp-and-bcp/-/issues/7). We now have actionable production change issues that can be used to guide SREs to recover component services during a zonal outage.<br>4. This Epic and it's two sub-epics can be closed out after the review. Sub-Epics: &1285 &1264<br>5. **Thanks**: Thank you to all those on the Ops team who ran Gamedays, this was very helpful to have different people running these.<br><br><br><br/><br/>**Nested Epics: 1**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1285+ **2024-07-24**: Final Status Update: 1. **Original Problem**: The gameday process was simple and revolved around VM provisioning exclusively. 2. **Changes Made**: Gameday processes were broken out into components that could be run in parallel and performed in a smaller, focused change window. Processes were tested and improved with a goal of closely emulating the actual steps required for a production recovery. Recordings and time measurements were added. 3. **Impact**: Several systemic problems were solved. Anything that failed during a gameday was improved and re-worked so it would not fail during a production recovery. Bootstrap scripts, terraform modules, Terraform process changes, were all improved. The confidence of several phase 1 components were improved and reflected in the &1250 upstream epic. 4. This epic can be closed after the review.<br/> | | [Plan Incident Management Process Improvements](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1256) <br/> ~"group::Networking & Incident Management" | 2024-02-28 | 2024-05-13 | **2024-05-01**: <br>Final Status Update:<br><br>1. The incident management process has some gaps which we have been intending to address for some time. This Epic tries to collect some of that work into a manageable plan, so that we can make things easier for on-call engineers while at the same time tightening up controls to make sure things don't get missed.<br>2. We've gone through existing issues, and consolidated them. We've surveyed the team and collected input and feedback. We've summarized that into what we believe are the focus areas where we can make the most difference, both in the short term and in terms of long term process changes. Survey results can be found [here](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/25247#note_1890630850).<br>3. With this, we should be able to focus next quarter's work on a specific set of tasks which will reduce the cognitive load on the EOC's, bring new EOC's up to speed faster, shift some of our incident management and review tasks to the teams who need to be most aware of their service's incidents, and improve our automation around incident management.<br>4. Q2 epics are: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1299, https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1307, https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1315<br>5. This epic can be closed once the review is complete<br><br><br/><br/>**Nested Epics: 3**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1309+ <br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1262+ **2024-04-17**: Drafting OKR's in https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/25264 - Currently leaning towards focusing on the Incident Review Process as the next step.<br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1271+ **2024-05-01**: 1. Moved all issues to one of three epics. The alert updates that we plan to do, the runbook reorganization that we can consider next, and the backlog epic for issues that we want to keep but are not ready to immediately start working on.<br/> | | [Data Preservation and Restores](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1279) <br/> ~"group::Networking & Incident Management" | 2024-03-27 | 2024-10-11 | **2024-10-08**: <br>**Final Status Update**<br><br>- The original problem we were trying to solve was to reduce the excessive load on the infrastructure team servicing various types of data restore requests.<br>- The initial plan was to implement [Export on Delete functionality](https://gitlab.com/gitlab-org/gitlab/-/issues/427227), but after months of discussion on that issue, we decided on another path. Instead we have made process changes to limit the number and complexity of these requests and direct them to the Support Team instead. Between SIRT and Support self servicing the requests, we are not overloading the Reliability teams any longer. [More details are in the issue...](https://gitlab.com/groups/gitlab-com/gl-infra/-/work_items/1279#note_2147191995)<br>- The new handbook page is: https://internal.gitlab.com/handbook/engineering/infrastructure/customer-data-requests/<br>- **This epic can be closed** during the Grand Review, after appropriate celebration and festivities<br>- There were far too many people involved in this to do individual shoutouts, but the Ops team is extremely appreciative of all of the contributions, and the long push towards getting this resolved!<br><br> | | [Produce Requirements List of Improvements to our Incident Tooling](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1315) <br/> ~"group::Networking & Incident Management" | 2024-05-16 | 2024-07-26 | **2024-07-24**: <br>**Final Status update** - We have gathered requirements and identified the gaps between what we have and what we need. We now have [a proposal summarizing the next decision](https://docs.google.com/document/d/1vwE75g1oG5pLCcPEFcLuY-zN43eJOWx5Oh7pHo3J_oo) (build vs. buy). This epic can be closed, and we will start a new one for next quarter based on which path we decide to take.<br><br> | | [Define Processes for incident reviews to be done by service owners](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1307) <br/> ~"group::Networking & Incident Management" | 2024-05-16 | 2024-08-09 | **2024-07-31**: Final Status Update:<br><br>1. **Problem Statement**: Our current incident review process focuses primarily on immediate corrective actions instead of longer-term learning and improvements. Additionally, the process is primarily driven by the EOC and IMOC with service owners playing a secondary role<br>2. **Changes Made**: We moved from the EOC being the default DRI to having the incident reviews be owned by the team that owns the service. We removed many of the prompts in the incident review template which had built up over time and were making the reviews feel like filling out a form with information which already existed elsewhere. We shifted to a narrative format so the reviews feel more like a blog post than a checklist. We then clarified which tasks are to be done by the EOC and which by the service owner.<br>3. **Impact**: We now have assurance that the team which can most effectively address the cause of an incident fully understands what happened, since they are the ones who wrote up the narrative. This has already reduced some toil for EOC, and clarified some confusion around roles and responsibilities. All feedback so far has been extremely positive.<br>3. This epic can be closed out after the review<br>4. **Thanks**: Thanks to the Ops team and everyone else who joined in the bi-weekly incident response conversations. These conversations helped us clarify who needs to do which parts of the review process, and helped the final result take shape.<br><br> | | [Create guidelines for alert Playbooks](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1299) <br/> ~"group::Networking & Incident Management" | 2024-05-16 | 2024-08-09 | **2024-07-31**: Final Status Update:<br><br>1. **Problem Statement**: The state of our runbooks have been one of the biggest complaints of EOCs. Not being able to quickly find information on unfamiliar alerts has been causing stress and dread among engineers on call<br>2. **Changes Made**: We have introduced the concept of Alert Playbooks, in addition to the existing Runbooks. Every alert must now have an associated playbook with an explanation, service ownership info, and other metadata about the alert. We've created a template laying out a common format and common set of expected information that an EOC may need for any alert. Based on this template we have created an initial set of Playbooks based on the most frequent alerts over the 90 days before starting. Additionally, we have laid out a process for service owners to create a Playbook to go with each alert that they are responsible for.<br>3. **Impact**: The most frequent response by people seeing these for the first time has been "Where have you been all of my life?" There is broad agreement that this will reduce the stress of being on call and result in faster resolution to unfamiliar incidents.<br>3. This epic can be closed out after the review<br>4. **Thanks**: Thanks to everyone on the ops team for stepping up and creating a bunch of Playbooks, as well as providing feedback along the way that helped refine the template and the process.<br><br> | | [Counterpart: Runners - VPC peering solution for CI SaaS Hosted Runners](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1386) <br/> ~"group::Networking & Incident Management" | 2024-08-01 | 2024-12-06 | **2024-12-03**: <br>CAN BE CLOSED.<br><br>:tada: **achievements**:<br>1. Evaluate and choose a networking design that scales.<br>2. Migrate shards that need immediate scaling (`private`) to the shared VPC.<br>3. Document the process of migrating existing shards to the new design.<br>4. Take actions to resolve other saturation issues in other pressured shards (`-org`).<br>5. Introduce members from the Ops team to the hosted runner's scaling process.<br><br>:arrow_forward: **next**:<br>- follow-up epics:<br>1. Hosted Runners service documentation :point_right: &1457.<br>2. Consolidate the configs of 3 shards in terraform (combining resources in an easy-to-scale format) :point_right: &1458.<br>3. Migrate the rest of the shards (10) to the new network design using the documented process :point_right: &1459.<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1386#note_2237241716_<br><!-- STATUS NOTE END --><br> | | [Evaluate Third Party Incident Management Tooling](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1374) <br/> ~"group::Networking & Incident Management" | 2024-08-01 | 2025-01-15 | **2025-01-14**: <br>We're waiting for IT to get back to us regarding the removal of the incident.io application from the sandbox Slack so we can close this epic. There's nothing more to do here on our side.<br><br>EDIT: All integrations have been removed and we can now close the epic.<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1374#note_2295204703_<br><!-- STATUS NOTE END --><br><br/><br/>**Nested Epics: 2**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1428+ <br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1383+ <br/> | | [FY25 Q3 Establish GitLab.com patching processes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1380) <br/> ~"group::Networking & Incident Management" | 2024-08-01 | 2024-11-12 | **2024-11-06**: <br> | | [FY25 Q3 Zonal Outage Resilience](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1378) <br/> ~"group::Networking & Incident Management" | 2024-08-01 | 2024-11-13 | | | [FY25 Q4 GitLab.com Patching & OS modernization](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1441) <br/> ~"group::Networking & Incident Management" | 2024-11-01 | 2025-01-24 | **2025-01-21**: <br>**Final status update**<br><br>**Original problem summary**<br><br>At the beginning of Q4, our various static VM fleets supporting GitLab.com required manual work assignment for any found security issues, and rectifying these was consistently a manual process. Additionally, our stagnant Chef infrastructure was presenting a major blocker for deploying new operating systems that would get continued security updates in the future.<br><br>**Changes made**<br><br>- Implemented a [patching-notifier](https://gitlab.com/gitlab-com/gl-infra/ops-team/toolkit/patching-notifier) service to automatically assign GitLab issues to service owners when security problems are detected with their systems.<br>- Created a [patch-automation](https://gitlab.com/gitlab-com/gl-infra/ops-team/toolkit/patch-automation) framework to automate patching and reboot processes for systems. The Ops team set up automations for 3 initial types of VMs:<br>- haproxy<br>- CI Runners managers<br>- bastion hosts<br>- Added Chef 16 and Ubuntu 22.04 support to commonly used cookbooks.<br><br>**Impact**<br><br>- patching-notifier has so far created 11 issues for security issues found on our VM fleets. 6 of which having already been addressed.<br>- patch-automation has been used to automate patching and reboots of at least 88 systems in GPRD. We look forward to seeing additional services extend this framework to add additional automations.<br>- With all commonly deployed cookbooks now supporting Chef 16, and having passed kitchen tests on Ubuntu 22.04, we are unblocked on any future efforts to deploy instances on more modern OSs.<br><br>---<br><br>We've completed all tasks associated with this epic and it can now be closed!.<br><br>Huge thanks to @astarovoytov for helping to get the patch-automation CI pipelines up and running and @thisisshreya for leading the effort to onboard various systems into the patching-notifier workflows! :tada:<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1441#note_2306786429_<br><!-- STATUS NOTE END --><br> | | [FY25 Q4 Zonal and Regional Outage Resilience](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1440) <br/> ~"group::Networking & Incident Management" | 2024-11-01 | 2025-02-11 | **2025-01-28**: <br>We are still investigating how we can best contribute to enabling [GEO for Cells](https://gitlab.com/groups/gitlab-org/-/epics/16339#note_2306085402). [Preparations for Q1 gameday assignments](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1506) is progressing and initial issues are created for assignments.<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1440#note_2317168175_<br><!-- STATUS NOTE END --><br> | | [SaaS Hosted Runner - Terraform environment consolidation](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1458) <br/> ~"group::Networking & Incident Management" | 2024-11-28 | 2025-01-10 | **2024-12-24**: <br>Closing status:<br><br>**Original problem**: Terraform environments for `ci-private`, `r-saas-l-l-amd64` and `r-saas-l-m-amd64` Runner shards deviated from the established environment configuration patterns used across other environments.<br><br>**Changes made**: Terraform environment consolidation for `ci-private`, `r-saas-l-l-amd64` and `r-saas-l-m-amd64` shards has been completed. With this we simplified terraform projects for the shards which will help to reduce errors and simplify scaling in the future.<br><br>This Epic can be closed :tada:<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1458#note_2272719586_<br><!-- STATUS NOTE END --><br> | | [.com Hosted Runner - Migrate shards to the shared VPC](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1459) <br/> ~"group::Networking & Incident Management" | 2024-12-02 | 2025-01-10 | **2024-12-17**: <br>**This epic can be closed!** :tada: (cc @kkyrala @rnienaber @marin @fzimmer - sorry not sure who to mention to close this eventually so mentioning everyone I can think of).<br><br>Shout out to @swainaina for his great collaboration on this epic, thank you! :star: :collaboration:<br><br> | | [SaaS Hosted Runners - improve Runbooks](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1457) <br/> ~"group::Networking & Incident Management" | 2024-12-29 | 2025-04-11 | **2025-04-08**: <br>When this epic started, we aimed to address outdated, disorganized, and hard-to-search documentation related to Runners. We have significantly improved the organization, accuracy, and usability of the Runners documentation, making it easier to find information and respond to alerts effectively, and we feel we can close the epic as documentation is an ongoing process.<br><br>I wish to thank @rehab, who helped review and confirm the accuracy of the updates, as well as add updates and create new documentation to help address areas that were not previously documented<br><br><br>:tada: **achievements**:<br><br>1. **Documentation Restructuring**:<br>* Restructured and updated the Runner Introduction section ([MR #8321](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8321))<br>* Restructured and updated the runner troubleshooting runbooks for easier updates and navigation ([MR #8644](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8644))<br>* Updated the deployment documentation for CI Runners :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8738<br>* Added documentation on creating New Shards :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8740<br>* Create dedicated documentation for scaling existing shards :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8743<br>* Added documentation on VPC peering solution for CI SaaS Hosted Runners :point_right: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/25949<br>* Update the shard names :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8749<br>2. **Alert Playbooks Creation**:<br>* Added playbook for `CiRunnersServiceQueuingQueriesDurationApdexSLOViolation` alert ([MR #8339](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8339))<br>* Added playbook for `CiRunnersServicePollingErrorSLOViolation` alert ([MR #8438](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8438))<br>* Worked on `CiRunnuerJobsApdexSLOViolationSingleShard` alert playbook ([MR #8500](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/8500))<br>* Reviewed other alerts and determined some had very few recent occurrences, and therefore, the alert playbooks were not necessary at this point :point_right: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26162<br>3. **Documentation Cleanup**:<br>* Removed outdated information and alerts that are no longer relevant<br>* Migrated networking runbook content to more appropriate locations<br>* Added improved introductory sections compiled from various runbooks<br><br>:issue-blocked: **blockers**:<br>- None<br><br>:arrow_forward: **next**:<br>- Moved this :point_right: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/38734 to the runners team issue tracker<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1457#note_2438858243_<br><!-- STATUS NOTE END --><br> | | [FY26 Q2 DR Gameday Work](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1576) <br/> ~"group::Networking & Incident Management" | 2025-05-01 | 2025-07-25 | **2025-07-22**: <br>:tada: **achievements**:<br><br>- We are happy to close this epic and move the remaining work on the Q3 Epic<br>- Work has started there, and the first gameday was executed this week<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1576#note_2642449819_<br><!-- STATUS NOTE END --><br> | | [Implementing automation between incident.io and Gitlab via woodhouse](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1592) <br/> ~"group::Networking & Incident Management" | 2025-05-29 | 2025-08-01 | **2025-07-29**: <br>:tada: **achievements**:<br><br>We have successfully implemented [**automatic attaching of related incident issues to incident issues in Gitlab.com**](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26860) in Production ([example](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/1466#note_2654040296)).<br>Additionally, we have added a configuration that allows us to **view metrics on [Grafana]**(https://gitlab.com/gitlab-com/gl-infra/woodhouse/-/merge_requests/670). **We can now proceed to close this epic** since we have delivered all the planned automation features between Gitlab and incident.io<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1592#note_2655166043_<br><!-- STATUS NOTE END --><br> | </details> <!-- STATUS END -->
epic