[Deprecated] GitLab SaaS Reliability - work queue
## **This epic is for FY23 and earlier, please use https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/898+**
The GitLab Reliability team is comprised of SREs and DBREs assigned to five teams.
### :books: References and helpful links
- [Getting assistance](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/#getting-assistance)
- [How we work](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/#how-we-work)
- [How work is associated to this epic](TBD)
### :pushpin: Teams
| Team | Slack | Members | EM |
| --- | --- | --- | --- |
| [Database Reliability](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/database-reliability.html) | `#g_infra_database_reliability` | @Finotto @bshah11 @alexander-sosna @rhenchen.gitlab | @dcurlewis |
| [Foundations](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/foundations.html) | `#g_infra_foundations` | @mchacon3 @miladx @T4cC0re @f_santos @pguinoiseau @sarahwalker | @amoter |
| [General](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/general.html) | `#g_infra_general` | @devin @ahanselka @cmcfarland @anganga @gsgl @ayeung | @afappiano |
| [Observability](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/observability.html) | `#g_infra_observability` | @craig @cindy @mwasilewski-gitlab @knottos @nduff | @dawsmith |
| [Practices](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/practices.html) | `#g_infra_practices` | @steveazz @nnelson @ahmadsherif @rehab @fshabir | @kwanyangu |
| [Management Activities](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/854) | `#reliability_lounge` | @alanrichards @dcurlewis @amoter @dawsmith @afappiano @kwanyangu @jarv | @alanrichards |
_The text above is automatically generated from [epic-issue-summaries](https://gitlab.com/gitlab-com/gl-infra/epic-issue-summaries/-/blob/master/teams/reliability.rb)_
## Project Work
### :white_check_mark: Completed Work
Items that have been completed ~"workflow-infra::Done"
<details>
| **Topic** | **Started** | **Ended** | **Summary** |
|-----------| ------------| ----------| ------------ |
| [Redis 6 upgrade](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/395) <br/> ~"team::Reliability" | 2021-01-14 | 2021-07-08 | |
| [Upgrade to terraform v1.0](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/466) <br/> ~"team::Reliability" | 2021-05-01 | 2022-11-23 | **2021-08-18**: All modules and environments related to `gitlab-com-infrastructure` have been updated to Terraform v1.0.4. |
| [Postgres 12 upgrade: Post-upgrade](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/464) <br/> ~"team::Reliability" | 2021-09-12 | 2022-12-12 | **2021-07-28**: We still have some issues to address here. For example https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13328.<br/><br/>**Nested Epics: 1**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/578+ <br/> |
| [Upgrade packages.gitlab.com instance to stop publish jobs from failing](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/592) <br/> ~"team::Reliability" | 2021-11-01 | 2022-12-15 | **2022-10-03**: Instance was upgraded to c5a.8xlarge - https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/592#note_1123122210 |
| [Add a Google Cloud CDN in front of the Registry service](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/657) <br/> | 2022-01-06 | 2022-01-31 | **2022-01-28**: CDN switchover complete |
| [Gitaly cgroups operational improvements](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/831) <br/> ~"team::Practices" | 2022-10-01 | 2023-01-11 | **2022-01-11**: This project focused on day 2 operations after we've rolled out cgroups in production in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/344+, there were a bunch of distractions and interruptions while working on this project specifically ~"work::incident" and ~"work::general" issues.<br/><br/>Some of the biggest achievements we had was a [70% reduction on metric cardinality](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16777#results) for monitoring cgroups, and upstream multiple changes to `gitaly` source code such as [logging which cgroup was used for a git command](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16456#results-1). This was the first time the Gitaly Stable Counterpart upstream code to Gitaly itself and served as a good stepping stone to continue building knowledge of SRE for Gitaly. |
| [Consul upgrade](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/844) <br/> ~"team::Foundations" | 2022-11-01 | 2022-12-19 | **2022-12-20**: - Consul migrated and upgraded in all environments<br/>- Improved reliability (e.g. autoscaling, DNS failures, backups)<br/>- Completed Readiness review<br/>- Documented future improvements |
| [OS Ubuntu upgrade of PGBouncer nodes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/835) <br/> | 2022-11-01 | 2022-12-13 | **2022-12-13**: With the successful execution of the Prod CR, all pgbouncer nodes in GPRD are now running with Ubuntu 20.04, putting a closure into this project. |
| [Multiple Linux Runner Sizes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/768) <br/> | | 2022-09-27 | **2022-09-27**: This epic is completed. |
| [Use the internal network for Git traffic between runners and GitLab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/658) <br/> | | 2022-03-21 | **2022-03-15**: VPC Peering in production is working as expected. All production runner shards are now using the internal endpoints for Git-over-HTTPS communications. |
| [Expanding list of team members in IMOC rotation](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/506) <br/> ~"team::Reliability" | | 2022-11-23 | |
| [Conduct Praefect readiness review](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/503) <br/> ~"team::Reliability" | | 2022-11-23 | **2021-06-30**: Praefect readiness review completed |
</details>
### :x: Cancelled
Epics that were cancelled ~"workflow-infra::Cancelled"
<details>
| **Topic** | **Ended** | **Summary** |
|-----------| ----------| ------------ |
| [Jaeger Distributed Tracing Enabled for GitLab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/210) | 2022-04-04 | |
| [Database cluster creation and Gitaly cluster dogfooding](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/492) | 2022-11-23 | **2021-12-01**: The work in this epic was canceled due to priority shifts to focus on Gitaly cluster dogfooding |
| [Expand internal dogfooding of Gitaly Cluster.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/640) | 2022-11-23 | **2022-01-06**: Canceled due to blocking issues that prevent us from using Gitaly Cluster on .com |
| [Refresh SRE hiring loop](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/496) | 2022-11-23 | |
| [Migrate Patroni Cluster to Ubuntu 18.04](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/502) | 2021-06-29 | |
| [Sharding POC environment](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/471) | 2021-07-07 | |
| [Database host management with Ansible](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/478) | 2021-07-26 | |
| [Design and implement infrastructure changes necessary to support 2x the Q1 peak for Gitaly.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/501) | 2022-11-23 | **2021-06-23**: Looking for guidance on percentage increase of monthly spend acceptable as well as a discussion on a dramatic shift to this OKR. |
| [Design and implement infrastructure changes necessary to support 2x the Q1 peak for the primary database.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/500) | 2022-11-23 | **2021-06-23**: Looking for guidance on percentage increase of monthly spend acceptable as well as a discussion on a dramatic shift to this OKR. |
</details>
epic