[Deprecated] GitLab SaaS Reliability - work queue
## **This epic is for FY23 and earlier, please use https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/898+** The GitLab Reliability team is comprised of SREs and DBREs assigned to five teams. ### :books: References and helpful links - [Getting assistance](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/#getting-assistance) - [How we work](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/#how-we-work) - [How work is associated to this epic](TBD) ### :pushpin: Teams | Team | Slack | Members | EM | | --- | --- | --- | --- | | [Database Reliability](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/database-reliability.html) | `#g_infra_database_reliability` | @Finotto @bshah11 @alexander-sosna @rhenchen.gitlab | @dcurlewis | | [Foundations](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/foundations.html) | `#g_infra_foundations` | @mchacon3 @miladx @T4cC0re @f_santos @pguinoiseau @sarahwalker | @amoter | | [General](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/general.html) | `#g_infra_general` | @devin @ahanselka @cmcfarland @anganga @gsgl @ayeung | @afappiano | | [Observability](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/observability.html) | `#g_infra_observability` | @craig @cindy @mwasilewski-gitlab @knottos @nduff | @dawsmith | | [Practices](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/practices.html) | `#g_infra_practices` | @steveazz @nnelson @ahmadsherif @rehab @fshabir | @kwanyangu | | [Management Activities](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/854) | `#reliability_lounge` | @alanrichards @dcurlewis @amoter @dawsmith @afappiano @kwanyangu @jarv | @alanrichards | _The text above is automatically generated from [epic-issue-summaries](https://gitlab.com/gitlab-com/gl-infra/epic-issue-summaries/-/blob/master/teams/reliability.rb)_ ## Project Work ### :white_check_mark: Completed Work Items that have been completed ~"workflow-infra::Done" <details> | **Topic** | **Started** | **Ended** | **Summary** | |-----------| ------------| ----------| ------------ | | [Redis 6 upgrade](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/395) <br/> ~"team::Reliability" | 2021-01-14 | 2021-07-08 | | | [Upgrade to terraform v1.0](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/466) <br/> ~"team::Reliability" | 2021-05-01 | 2022-11-23 | **2021-08-18**: All modules and environments related to `gitlab-com-infrastructure` have been updated to Terraform v1.0.4. | | [Postgres 12 upgrade: Post-upgrade](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/464) <br/> ~"team::Reliability" | 2021-09-12 | 2022-12-12 | **2021-07-28**: We still have some issues to address here. For example https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13328.<br/><br/>**Nested Epics: 1**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/578+ <br/> | | [Upgrade packages.gitlab.com instance to stop publish jobs from failing](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/592) <br/> ~"team::Reliability" | 2021-11-01 | 2022-12-15 | **2022-10-03**: Instance was upgraded to c5a.8xlarge - https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/592#note_1123122210 | | [Add a Google Cloud CDN in front of the Registry service](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/657) <br/> | 2022-01-06 | 2022-01-31 | **2022-01-28**: CDN switchover complete | | [Gitaly cgroups operational improvements](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/831) <br/> ~"team::Practices" | 2022-10-01 | 2023-01-11 | **2022-01-11**: This project focused on day 2 operations after we've rolled out cgroups in production in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/344+, there were a bunch of distractions and interruptions while working on this project specifically ~"work::incident" and ~"work::general" issues.<br/><br/>Some of the biggest achievements we had was a [70% reduction on metric cardinality](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16777#results) for monitoring cgroups, and upstream multiple changes to `gitaly` source code such as [logging which cgroup was used for a git command](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16456#results-1). This was the first time the Gitaly Stable Counterpart upstream code to Gitaly itself and served as a good stepping stone to continue building knowledge of SRE for Gitaly. | | [Consul upgrade](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/844) <br/> ~"team::Foundations" | 2022-11-01 | 2022-12-19 | **2022-12-20**: - Consul migrated and upgraded in all environments<br/>- Improved reliability (e.g. autoscaling, DNS failures, backups)<br/>- Completed Readiness review<br/>- Documented future improvements | | [OS Ubuntu upgrade of PGBouncer nodes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/835) <br/> | 2022-11-01 | 2022-12-13 | **2022-12-13**: With the successful execution of the Prod CR, all pgbouncer nodes in GPRD are now running with Ubuntu 20.04, putting a closure into this project. | | [Multiple Linux Runner Sizes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/768) <br/> | | 2022-09-27 | **2022-09-27**: This epic is completed. | | [Use the internal network for Git traffic between runners and GitLab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/658) <br/> | | 2022-03-21 | **2022-03-15**: VPC Peering in production is working as expected. All production runner shards are now using the internal endpoints for Git-over-HTTPS communications. | | [Expanding list of team members in IMOC rotation](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/506) <br/> ~"team::Reliability" | | 2022-11-23 | | | [Conduct Praefect readiness review](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/503) <br/> ~"team::Reliability" | | 2022-11-23 | **2021-06-30**: Praefect readiness review completed | </details> ### :x: Cancelled Epics that were cancelled ~"workflow-infra::Cancelled" <details> | **Topic** | **Ended** | **Summary** | |-----------| ----------| ------------ | | [Jaeger Distributed Tracing Enabled for GitLab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/210) | 2022-04-04 | | | [Database cluster creation and Gitaly cluster dogfooding](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/492) | 2022-11-23 | **2021-12-01**: The work in this epic was canceled due to priority shifts to focus on Gitaly cluster dogfooding | | [Expand internal dogfooding of Gitaly Cluster.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/640) | 2022-11-23 | **2022-01-06**: Canceled due to blocking issues that prevent us from using Gitaly Cluster on .com | | [Refresh SRE hiring loop](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/496) | 2022-11-23 | | | [Migrate Patroni Cluster to Ubuntu 18.04](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/502) | 2021-06-29 | | | [Sharding POC environment](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/471) | 2021-07-07 | | | [Database host management with Ansible](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/478) | 2021-07-26 | | | [Design and implement infrastructure changes necessary to support 2x the Q1 peak for Gitaly.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/501) | 2022-11-23 | **2021-06-23**: Looking for guidance on percentage increase of monthly spend acceptable as well as a discussion on a dramatic shift to this OKR. | | [Design and implement infrastructure changes necessary to support 2x the Q1 peak for the primary database.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/500) | 2022-11-23 | **2021-06-23**: Looking for guidance on percentage increase of monthly spend acceptable as well as a discussion on a dramatic shift to this OKR. | </details>
epic