GitLab SaaS Reliability - work queue - FY24 (#898) · Epics · GitLab Infrastructure Team

GitLab SaaS Reliability - work queue - FY24

The GitLab Reliability team is comprised of SREs and DBREs assigned to five teams. ### :books: References and helpful links - [Getting assistance](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/#getting-assistance) - [How we work](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/#how-we-work) - [How work is associated to this epic](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/issues.html#epics) - [Workflow labels explained](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/issues.html#workflow-labels) ### :pushpin: Teams | Team | Links | Members | EM | | --- | --- | --- | --- | | [Database Reliability](/groups/gitlab-com/gl-infra/-/epics/898#database-reliability-teamdatabase-reliability) | [`#g_infra_database_reliability`](https://gitlab.slack.com/archives/C02K0JTKAHJ) [ℹ️ handbook](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/database-reliability.html) | @bshah11 @alexander-sosna @rhenchen.gitlab | @kwanyangu | | [Foundations](/groups/gitlab-com/gl-infra/-/epics/898#foundations-teamfoundations) | [`#g_infra_foundations`](https://gitlab.slack.com/archives/C0313V3L5T6) [ℹ️ handbook](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/foundations.html) | @mchacon3 @miladx @T4cC0re @f_santos @pguinoiseau @sarahwalker | @amoter | | [General](/groups/gitlab-com/gl-infra/-/epics/898#general-teamgeneral) | [`#g_infra_general`](https://gitlab.slack.com/archives/C04MH2L07JS) [ℹ️ handbook](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/general.html) | @devin @ahanselka @cmcfarland @anganga @gsgl @ayeung | @afappiano | | [Observability](/groups/gitlab-com/gl-infra/-/epics/898#observability-teamobservability) | [`#g_infra_observability`](https://gitlab.slack.com/archives/C0496692EHY) [ℹ️ handbook](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/observability.html) | @craig @knottos @nduff | @dawsmith | | [Practices](/groups/gitlab-com/gl-infra/-/epics/898#practices-teampractices) | [`#g_infra_practices`](https://gitlab.slack.com/archives/C04M6HVAY49) [ℹ️ handbook](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/practices.html) | @sxuereb @nnelson @ahmadsherif @rehab @fshabir | @kwanyangu | | [Management Activities](/groups/gitlab-com/gl-infra/-/epics/898#project-work) | [`#reliability_lounge`](https://gitlab.slack.com/archives/C03QC5KNW5N) | @alanrichards @amoter @dawsmith @afappiano @kwanyangu @jarv | @alanrichards | _The text above is automatically generated from [epic-issue-summaries](https://gitlab.com/gitlab-com/gl-infra/epic-issue-summaries/-/blob/master/teams/reliability.rb)_ ## Database Reliability ~"team::Database Reliability" ### :hourglass: Work In Progress These epics are currently in progress ~"workflow-infra::In Progress" | **Topic** | **Start Date** | **Target End Date** | **Summary** | |-----------|----------------|-------------------------------|-------------| | [GitLab.com replica database access for Trust and Safety automation](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/923) @bshah11 (+0 participants) | 2023-03-01 | 2023-03-31 | **2023-06-01**: We are working on a rollout strategy with a grace period where we will monitor the workload closely to avoid impact on patroni production clusters - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17499#note_1415143644 , https://gitlab.com/gitlab-com/gl-security/security-change-management/-/issues/9#note_1415145611 Production Teleport cluster was migrated from a SPoF VM `teleport-01-inf-gprd` at `teleport.gprd.gitlab.net:3080` to the new Kubernetes HA deployment at `production.teleport.gitlab.net`- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/13135 SRE has implemented the Machine ID to establish connectivity between `glsec-trust-safety-live` and production environments - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17497#note_1402887590 We have identified an overall strategy for the Trust and Safety application to access necessary data from Patroni Main and CI Clusters. An alternative to establishing VPC peering via Teleport's Machine ID was validated by SRE. We have already validated the process to grant limited access to specific tables of the Main patroni cluster and tested it in the `gprd` environment - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17497#note_1404714038 . For now, we are planning to go with a Patroni backup node's hostname/ip for `trust-and-safety` team to access the data. cc @kwanyangu @gitlab-com/gl-security/security-operations/trust-and-safety | ### :white_check_mark: Completed Work Items that have been completed ~"workflow-infra::Done" <details> | **Topic** | **Started** | **Ended** | **Summary** | |-----------| ------------| ----------| ------------ | | [Hardware upgrade for Patroni nodes - GitLab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/851) | 2022-12-01 | 2023-10-12 | **2023-06-09**: We completed the hardware upgrade of our `patroni-main` and `patroni-ci` and all nodes are now running over N2 hardware. We could scale down the `patroni-ci` cluster to 3 active nodes (and 1 backup node) as planned to reduce costs. However, we could not scale down the `patroni-main` cluster to less than 6 active nodes (and 1 backup node) because with a smaller amount of nodes our read-only workload is facing `lwlock lock_manager` saturation as detailed at https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/18934#note_1423061617. The plan was to reduce `patroni-main` to 4 active nodes to improve our ROI, given the active replicas [CPU usage with 6 nodes is less than 40% in average with eventual spikes that reach 50%](https://thanos-query.ops.gitlab.net/graph?g0.expr=avg(instance%3Anode_cpu_utilization%3Aratio%7Benv%3D%22gprd%22%2Cenvironment%3D%22gprd%22%2Ctype%3D%22patroni%22%7D)%20by%20(fqdn)%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D1%20&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g0.end_input=2023-06-09%2001%3A34%3A24&g0.moment_input=2023-06-09%2001%3A34%3A24), meaning that we are not making use of around 50% of the resources. | </details> ### :rotating_light: Epics that need attention These linked epics are not in the correct state or missing a workflow label <details> | **Topic** | **Links** | **Reason** | |-----------|-----------|-------------| | [PostgreSQL upgrade to major version 14](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/642) @alexander-sosna (+0) team::Database Reliability | Epic has ~"workflow-infra::In Progress" but is closed | </details> ## Foundations ~"team::Foundations" ### :white_check_mark: Completed Work Items that have been completed ~"workflow-infra::Done" <details> | **Topic** | **Started** | **Ended** | **Summary** | |-----------| ------------| ----------| ------------ | | [Address all Foundations owned Med severity Pentest findings.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/916) | 2022-11-01 | 2023-12-19 | **2023-07-26**: See this week's update for the service account keys rotation [here](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16689#note_1464734683). This huge piece of work is progressing, and is the only remaining pentest work to be done. Now focusing again on [switching to Workload Identity for the GitLab.com deployment](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/22409) and updating the CustomersDot projects. There are however some technical blockers for the remaining work: Workload Identity not fully supported by GitLab in Kubernetes, OAuth access tokens not supported by `gsutil`, Chef secrets in Vault complicated to update... | | [Secret management improvements](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/843) | 2022-11-01 | 2023-02-21 | **2023-02-17**: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/843#note_1282495927 | | [Kubernetes GitOps Proof of Concept](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/829) | 2022-11-01 | 2023-02-21 | **2023-02-03**: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/829#note_1264235448 | | [Foundations: FY2023 Q2 Cloudflare savings](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/976) | 2023-05-01 | 2023-08-07 | **2023-07-07**: Order has been signed to retain critical features we are using, discussion and investigation continues around bandwidth usage and forecast. | | [Implement GitOps for a Foundation's owned service](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1030) | 2023-05-22 | 2023-07-27 | **2023-07-27**: Validated Flux security (SAs, roles, etc), service is secure and uses least privilege permissions. This concludes the OKR for Q2. Scoping CI and production readiness for Q3. **Nested Epics: 1** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/993+ | | [Chef-Server to Cinc-Server migration](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/907) | | 2023-04-25 | **2023-04-25**: `chef.gitlab.com` has been successfully migrated from a legacy Chef v12 instance to a new instance running cinc-server v14.16.19. Work on this epic is now complete. We have performed the following cleanup tasks after the Chef to CINC migration: - Update documentation to point to new CINC server fqdn instead of legacy chef server. - Remove test and legacy chef instances. **Nested Epics: 2** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/910+ **2023-04-21**: Chef to CINC cutover maintenance is now completed: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8744 We will be using the CINC Server (cinc-01-inf-ops.c.gitlab-ops.internal) as a replacement of Chef moving forward. Planning and execution of the Chef data migration final cutover to CINC Server. • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/909+ **2023-04-13**: - Successfully migrated Chef users, acls, organizations, cookbooks, vault and databags from test instance `chef2` to `cinc` server using knife-ec-backup. - Tested knife authentication with CINC after user has been migrated from chef2. Users can reuse their existing pem files, and won't need to reissue any credentials to authenticate with CINC. - Interacting with cinc server with knife works as expected. - We tested bootstrapping a new VM using CINC. - Tested running chef-client on an existing VM while using cinc. | | [Terraform GitLab Infrastructure groups and projects](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/799) | | 2023-05-02 | **2023-05-02**: * `infra-mgmt` project setup, validated and operational, managing the `gitlab-com/gl-infra` group, infrastructure team groups, and a few projects on both `gitlab.com` and `ops.gitlab.net` * All current Vault-enabled projects have been imported, more infrastructure subgroups and projects are being imported a few at a time * Documentation has been added on how to use and contribute to `infra-mgmt` * Long running issues removed from the epic (import and update of all Infrastructure projects) | | [Migrate Reliability owned Chef secrets from GKMS to Vault](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/915) | | 2023-05-12 | **2023-05-12**: The only remaining Foundations' owned secret to be migrated is HAProxy. It has already been migrated in staging, and production will be done upon completion of the HAProxy upgrade, to be completed later this month. As such, we are closing this epic. | | [Address all Compliance related work for Foundations owned services](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/917) | | 2023-12-19 | **2023-06-12**: For the [last open compliance issue](https://gitlab.com/gitlab-com/gl-security/security-assurance/observation-management/-/issues/958), are now unblocked from ITs side. Next step is to terraform the required groups and permissions. | | [Project: Teleport High Availability](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/786) | | 2023-12-21 | **2023-10-24**: The main objective of this epic is complete. - Production migrated - Postgres CAs updated - Cleanup old Teleport VMs - Upgraded to Teleport v13 in staging and production - Remove old Okta apps (pending AR) - Slack plugin Helm Chart improvements upstream | </details> ### :rotating_light: Epics that need attention These linked epics are not in the correct state or missing a workflow label <details> | **Topic** | **Links** | **Reason** | |-----------|-----------|-------------| | [Foundations: FY23 Q2 GCP Cloudspend savings](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/977) @mchacon3 (+0) team::Foundations | Epic has ~"workflow-infra::In Progress" but is closed | | [Current-state OS evaluation and OS upgrade plans for Foundations' owned VMs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/952) @amoter (+0) team::Foundations | Epic has ~"workflow-infra::In Progress" but is closed | | [HAProxy Upgrades (1.8 LTS -> 2.8 LTS)](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/839) @miladx (+0) team::Foundations | Epic has ~"workflow-infra::In Progress" but is closed | </details> ## Project Work ### :hourglass: Work In Progress These epics are currently in progress ~"workflow-infra::In Progress" | **Topic** | **Start Date** | **Target End Date** | **Summary** | |-----------|----------------|-------------------------------|-------------| | [FY24 Q2: Reliability::General: Non-OKR work](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1043) @afappiano (+0 participants) ~"team::Ops" | 2023-05-01 | 2023-08-01 | **2023-06-06**: | | [FY24 Q2: Reliability::General: 2 unsupported customer facing services through production readiness](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1015) @afappiano (+0 participants) ~"team::Ops" | 2023-05-01 | 2023-07-31 | **2023-06-09**: We may be able to start on Package Cloud within the next 2 weeks and have begun defining work efforts. **Nested Epics: 1** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/922+ **2023-09-26**: - packages.gitlab.com was migrated to k8s on `2023-09-18` ([CR](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16271)) | | [FY24 Q2: Reliability::General: Reduce hosting costs by 3%](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1014) @afappiano (+0 participants) ~"team::Ops" | 2023-05-01 | 2023-07-31 | **2023-06-22**: Test site for Fastly-->CloudFlare migration is up, see: cf.about.gitlab.com. Cutover is still scheduled for 2023-07-05. We still hope to make some progress on the [DB savings issue](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17480) but that will not get started until after the Fastly-->CloudFlare work is completed. | | [FY24 Q2: Reliability::General: >99.95% availability for primary services (excluding git access & CI Runners & Sidekiq)](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/985) @afappiano (+0 participants) ~"team::Ops" | 2023-05-01 | 2023-07-31 | **2023-05-17**: We've had a few incidents contribute negatively to our availability numbers but we are not sure these reflect actual customer experience. We'll resolve that question and others as we follow up on all service related incidents focusing on the more recent ones first. We will continue to iterate on this and add work efforts here as they are identified. **Nested Epics: 2** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/998+ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/997+ | | [Reliability Management Activities](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/854) @alanrichards (+0 participants) ~"team::Reliability" | | 2023-01-31 | **Nested Epics: 1** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/852+ **2022-11-24**: Written materials are approaching finalization, and met with L&D this week to get an overview of LevelUp functionality. Next: create Slidedeck | ### :soon: Ready Linked epics that are ready to start ~"workflow-infra::Ready" | **Topic** | |-----------| | [Rotate Secrets and Create Tooling in Support of Secrets Rotation](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/443) ~"team::Reliability" | | [Audit and cleanup metric cardinality and series data](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/920) ~"team::Reliability-Observability" | | [Partition Prometheus in Kubernetes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/623) ~"team::Reliability-Observability" | ### :arrow_forward: Next These are the epics we will be focusing on next ~"workflow-infra::Proposal" | **Topic** | **Target Start Date** | **Summary** | |-----------|-----------------------|-------------| | [Change/Incident Management Refinement](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/961) @rehab (+0 participants) ~"team::Ops" | 2023-05-01 | **2023-05-13**: This epic doesn't currently have enough resources to drive it forward, individual efforts to fix issue-by-issue basis are welcome. | | [Improve Consul health checks for database nodes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/689) (+0 participants) ~"team::Reliability" | | **2022-11-23**: Marking this as a proposal since we don't have any work associated yet. | ### :anchor: Stalled These epics are stalled ~"workflow-infra::Stalled" | **Topic** | **Start Date** | **Target End Date** | **Summary** | |-----------|----------------|-------------------------------|-------------| | [GitLab.com Fleet OS Upgrades](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/231) @devin (+0 participants) ~"team::Ops" | | | **2022-11-23**: Marking this epic as stalled until a new DRI is assigned **Nested Epics: 2** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/680+ • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/645+ | | [Optimize Gitaly storage costs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/780) @jarv (+0 participants) ~"team::Reliability" | | | **2022-08-11**: As discussed in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16163#note_1058712350 we will be unable to do this migration without taking a large amount of downtime. For now it is stalled as I have moved onto work with Disaster Recovery. | ### :white_check_mark: Completed Work Items that have been completed ~"workflow-infra::Done" <details> | **Topic** | **Started** | **Ended** | **Summary** | |-----------| ------------| ----------| ------------ | | [2022 Annual 3rd Party Pentest Mitigations](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/834) ~"team::Ops" | 2022-11-01 | 2023-02-07 | **2022-12-14**: Current focus is on six remaining of nine total Medium priority issues. Work on https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16687 is underway. | | [Migrate services on EOL GKE clusters on 1.21 to new clusters](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/895) ~"team::Ops" | 2023-02-08 | 2023-02-22 | **2023-03-03**: All services have been migrated and the old cluster shut down. This work is complete. -------- We have a few GKE clusters which will be officially end of life on January 31st 2023. They are running GKE 1.21 and need to be upgraded. | | [FY24-Q2: Reliability::General Ensure all components of all General owned services are up to date](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/975) ~"team::Ops" | 2023-05-01 | 2023-11-05 | **2023-07-31**: | Service | Notes | Status for this OKR | | ------ | ------ | ------------------- | | api | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | camoproxy | we rolled our own Helm chart for this so it's not getting picked up by the Renovate bot. the version of `go-camo` used is a little old ([2.4.3](https://github.com/cactus/go-camo/releases) vs 2.4.0) | :white_check_mark: [Done](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/merge_requests/2639) | | contributors | it's just a dashboard | :white_check_mark: [Redirect moved to CF](https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/6374), pending [ownership update](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/6083) | | forum | [hosted by Discourse](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/139) so unlikely be out of date | :white_check_mark: Done | | nginx | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | web | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | git | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | mailroom | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | web-pages | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | plantuml | managed in [`tanka-deployments`](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/tanka-deployments) so unlikely to be out of date | :white_check_mark: Done | | registry | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | sidekiq | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | version | our team has a process to ensure it's updated every month | :white_check_mark: Done | | about | isn't there a team dedicated to this? | :question: Not done | | design | our team has a process to ensure it's updated every month | :white_check_mark: Done | | woodhouse | covered by Renovate bot, but there are a few MRs outstanding for it | :white_check_mark: All outstanding Renovate MRs merged and deployed | | kas | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | websockets | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | gitlab-com-artifact-registry | this was only recently spun up, presumably as part of the DR project. it's a [GCP offering](https://cloud.google.com/artifact-registry) so nothing to do here. | :white_check_mark: Done | | gitlab-com-pkgs | this was only recently spun up, presumably as part of the DR project. it's literally just a GCS bucket so nothing to do here. | :white_check_mark: Done | | packagecloud | our team was doing some work on this very recently, but there is no process for keeping it up to date (see https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24053) | :x: will likely become out of date soon | | internal-api | frequently deployed with the rest of the application via the release process, so I cannot imagine how this would get out of date | :white_check_mark: Done | | | [FY24 Q2: Support Development of AI Features](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/984) ~"team::Ops" | | 2023-10-25 | **2023-05-24**: @devin continues work to support AI in conjunction with @rnienaber's folks. Epic for that is here: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1005 | | [Move ops.gitlab.net out of us-east1 for DR](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/921) ~"team::Ops" | | 2023-10-26 | **2023-06-07**: Moving this out of ~"workflow-infra::In Progress". All critical tasks are completed. @ayeung and @gsgl will continue to make opportunistic progress where possible but this is no longer a primary focus for the team. | | [Tasks for the squad->team transition](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/888) ~"team::Reliability" | 2023-01-30 | 2023-02-08 | | | [Thanos stability improvements (migrate to helm and consolidate stores)](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/813) ~"team::Reliability-Observability" | 2023-01-01 | 2023-05-03 | **2023-04-19**: 1. Alert volume [still acceptable](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/813#note_1359408727) 2. Closed [relabeling change](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8700) so the February move to ruler cluster in ops GKE does not affect queries poorly. 3. Moved some issues from here to https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/623 for focus in Q2 | | [FY2024 Q1 - Elastic Search / Logging Improvements](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/904) ~"team::Reliability-Observability" | 2023-04-01 | 2023-05-03 | **2023-04-26**: 1. Closed out https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17336 with notes about plans to upgrade Elastic next Quarter 2. Alerts [remained at acceptable levels](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/904#note_1369502707) | | [FY2024Q2 - Update Observability Owned cookbooks to use Vault backend.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/953) ~"team::Reliability-Observability" | 2023-05-01 | 2023-07-13 | | | [FY2024-Q2 - OS upgrade or replace Observability owned VMs on 16.04](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/957) ~"team::Reliability-Observability" | 2023-05-01 | 2023-08-11 | | | [FY2024Q2 - Observability Tooling Spend and Toil Analysis](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/974) ~"team::Reliability-Observability" | | 2023-11-23 | | | [Runner SaaS Infra Operations](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/858) ~"team::Reliability-Practices" | 2022-08-18 | 2023-02-15 | **2023-01-12**: - Updating the default ruby image for docker jobs was rolled out. - | | [OS upgrade of Praefect proxy](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/862) ~"team::Reliability-Practices" | 2023-01-16 | 2023-03-29 | **2023-03-29**: The project is done check the [results](#results) | | [Adjust/lower arbitrary CI limits from data observations](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/906) ~"team::Reliability-Practices" | 2023-02-15 | 2023-03-28 | **2023-03-28**: All changes in this Epic are now complete :tada: | | [Support SaaS MacOS release](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/945) ~"team::Reliability-Practices" | 2023-03-23 | 2023-05-05 | **2023-05-05**: Access to AWS was granted, and steps for future ARs are documented the runbooks. | | [Private runners scaling](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1023) ~"team::Reliability-Practices" | 2023-05-24 | 2023-05-25 | **2023-05-25**: The Runners are now unpaused on all the instances and are serving traffic. We have effectively added an additional 50% of the previous capacity. :tada: | | [[CI Runners] create Small fleet](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1047) ~"team::Reliability-Practices" | | 2023-08-04 | **2023-08-04**: The old infrastructure have now been decommissioned, this epic is complete :tada: | | [Scale gitlab-org shared runners](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1021) ~"team::Reliability-Practices" | | 2023-06-07 | **2023-06-07**: The [scalability saturation issue](https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/560) was auto-closed :tada: - with the completion of the last issue in this epic, this can now be closed. | | [GPU enabled SaaS Linux runners](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/805) ~"team::Reliability-Practices" | | 2023-06-22 | **2023-06-22**: The followup cleanup child epic is now complete, this can be marked as ~"workflow-infra::Done". **Nested Epics: 1** • https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1011+ **2023-06-22**: The `medium` infrastructure is now live and receiving traffic :tada: - The `large` infrastructure have been decommissioned, this epic is now complete :tada: | </details> ### :x: Cancelled Epics that were cancelled ~"workflow-infra::Cancelled" <details> | **Topic** | **Ended** | **Summary** | |-----------| ----------| ------------ | | [Q4 Observability Decrease Monitoring Notifications](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/864) | 2023-02-14 | **2022-11-17**: - Migrated prometheus data stores to use SSD's to prevent future incidents with disk latency. | </details> ### :rotating_light: Epics that need attention These linked epics are not in the correct state or missing a workflow label <details> | **Topic** | **Links** | **Reason** | |-----------|-----------|-------------| | [FY24 Q2: Improve availability of Sidekiq to >99.95% and include as weighted primary service](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/986) @afappiano (+0) team::Ops | Epic has ~"workflow-infra::In Progress" but is closed | | [Q4 Observability Help SIRT Onboard application logs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/833) @cmcfarland (+0) team::Reliability-Observability | Labeling problem, epic has no workflow label | | [Define next steps for Gitaly repository load balancing](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/845) @steveazz (+0) team::Reliability-Practices | Epic has ~"workflow-infra::Proposal" but is closed | | [Migrate Inactive Repositories to HDD Gitaly nodes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/902) (+0) team::Reliability-Practices | Epic has ~"workflow-infra::Ready" but is closed | | [Significant increase in Gitaly single node incidents](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/991) @fshabir (+0) team::Reliability-Practices | Epic has ~"workflow-infra::In Progress" but is closed | | [Give SSH Access to the Gitaly Team](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/812) @steveazz (+0) team::Reliability-Practices | Epic has ~"workflow-infra::Proposal" but is closed | | [Close gap between staging and production for Gitaly](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/874) (+0) team::Reliability-Practices | Epic has ~"workflow-infra::Proposal" but is closed | | [Proposal: Improving alert mapping for incidents](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/775) @jarv (+0) | Epic has ~"workflow-infra::Proposal" but is closed | </details> --- :robot: Generated by automation from: https://gitlab.com/gitlab-com/gl-infra/epic-issue-summaries/-/jobs/7343070510

epic