Production Engineering Foundations
:mega: **This epic is no longer in use with **[**changes to the group team structure**](https://docs.google.com/document/d/1Mi45xQnKOgkD47WLnz4JUMFZOBeb32zxDuk4OfUiSDQ/edit?usp=sharing)**. For Networking related projects and updates, please see https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1676+. For Fleet Management related projects and updates, please see https://gitlab.com/groups/gitlab-com/gl-infra/platform/runway/-/epics/14+** ## Overview <!-- STATUS BEGIN --> The Foundations team is responsible for building, running, and owning the lifecycle of the core infrastructure for GitLab.com. This epic is a the high level epic for all project work across the Foundations team. ## Project Work ### :soon: Ready Linked epics that are ready to start | **Topic** | |-----------| | [Phase 2.1: Simplify RackAttack Application Rate Limiting Configuration](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1511) <br/> ~"group::Networking & Incident Management" | ### :stop_button: Triage Epics that are currently being triaged | **Topic** | **Summary** | |-----------|-------------| | [Grouping: Access, Connectivity, Bastions and Consoles Backlog](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/337) <br/> ~"group::Networking & Incident Management" | | | [Foundations: Review and plan low pentest findings](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/978) <br/> ~"group::Networking & Incident Management" | **2023-12-20**: <br>We asked AppSec if there are any SLAs or prioritization for low findings. The conclusion was that these can be placed in the backlog and potentially have no action taken and be closed eventually. One comment summarized<br><br>> If it’s a vulnerability that required a patch or reconfiguration yes we should have consistent SLA expectations. But if it’s something that requires a business process or technical architectural change then it’s not likely we would hit the defined SLA<br><br>Next steps are to verify that all open issues are not vulnerabilities, in which case we will close them and close this epic.<br> | | [Dynamic Ingress for Gitlab.com services in Non Prod environments.](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1157) <br/> ~"group::Networking & Incident Management" | **2024-02-05**: <br>- Work on this epic is wrapping up as we head into new commitments for Q1.<br>- Remaining work will be considered ~"workflow-infra::Stalled".<br>- Work on this EPIC was mostly focused on identifying an ingress solution for for multi-cluster services.<br>- Currently [we have identified the cloud-native ingress architecture](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/24969#note_1755773471) will be composed of GKE Gateway as frontend ingress and Istio Gateways as the only backends for it. We will implement all routing to services at the Istio layer using Istio VirtualService definitions. This will allow a better integration with Flagger whenever Delivery decides to pursue this for managing deployments.<br>- Next steps would be to complete setting up the above mentioned infrastructure on the `pre` and `gstg` environments and cutover traffic from HAProxy to the GKEGateway+Istio ingress.<br> | | [Draft: Implement robust, dynamic ingress for .com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1138) <br/> ~"group::Networking & Incident Management" | **2023-10-23**: <br>Epic created to serve as a parent epic for all Dynamic ingress work, the first portion of which will be implemented in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1157.<br> | | [Expand Teleport To All/Most Services](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/759) <br/> ~"group::Networking & Incident Management" | | ### :white_check_mark: Completed Work Items that have been completed <details> | **Topic** | **Started** | **Ended** | **Summary** | |-----------| ------------| ----------| ------------ | | [Migrate all remaining Reliability Owned CI secrets to Vault](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/913) <br/> ~"group::Networking & Incident Management" | 2023-02-16 | 2025-05-28 | **2025-05-28**: <br>The remaining issues as they have been sitting in the backlog for too long, many are out-of-date, and we don't have the bandwidth for it.<br><br>CI secrets are being progressively migrated to Vault as projects are being imported into `infra-mgmt` with Vault enabled to remediate expired tokens, and also as they are being migrated to `common-ci-tasks`.<br> | | [Make Cloudflare configuration auditable and transparent and bring deprecations up to date](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1180) <br/> ~"group::Networking & Incident Management" | 2023-11-01 | 2024-02-09 | **2024-02-05**: <br>Work on this epic is now complete, all resources are managed in Terraform, processes documented, and we've audited from 85 rules down to 60. Remaining cloudflare improvements have been moved to the ongoing Improvements epic https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1131 some of which will fall under the next quarters OKR https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1210<br><br><br><br/><br/>**Nested Epics: 3**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1141+ <br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1187+ <br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/982+ <br/> | | [Consolidate resource access and user provisioning in Teleport](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1142) <br/> ~"group::Networking & Incident Management" | 2023-11-01 | 2024-02-09 | **2024-02-08**: <br>All Google Compute instances (VMs) that Chef manages in the `pre` and `gstg` environments are now accessible and managed in [Staging Teleport](https://staging.teleport.gitlab.net/). Similarly, all Google Compute instances (VMs) that Chef manages in the `gprd` environment are now accessible and managed in [Teleport Production](https://production.teleport.gitlab.net/).<br><br>Furthermore, we have enabled *Enhanced Session Recording* (eBPF) on console nodes which enables a more powerful auditing and monitoring tools for the security teams.<br><br/><br/>**Nested Epics: 1**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1143+ <br/> | | [Foundations - incoming requests work - 2024](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1194) <br/> ~"group::Networking & Incident Management" | 2024-01-01 | 2025-03-04 | **2025-02-26**: <br>Two requests were opened this week to be triaged: [1](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26309), [2](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26314)<br><br>**Cloudflare**<br><br>Access requests:<br><br>- @ayeung https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/34147 - CorpSec group change in Vault<br>- @sabrams https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/34954 - single user Cloudflare access<br><br>Support requests:<br><br>- @ayeung, @donnaalexandra https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26220 - Customer investigation for blocked traffic, escalated to a Cloudflare ticket and was ultimately closed due to the customer no longer experiencing the issue.<br>- @pguinoiseau https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/25994 - a long running investigation for a customer experiencing SSH issues. This week it was re-escalated and we added some information to a Cloudflare ticket that we are now waiting on.<br><br>**Security**<br><br>Ops instance:<br><br>- @pguinoiseau - Enable TLS for ops instance - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26148. This was moved from the KTLO epic, while it relates to KTLO, it was created and requested by an outside team.<br><br>SIRT:<br><br>- @ayeung - https://gitlab.com/gitlab-com/gl-infra/infrastructure-lounge-slack-issue-tracker/-/issues/98 - adding a Cloudflare block for a SIRT incident. This also led to https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19368, which we then helped mitigate.<br><br>Teleport:<br><br>- @mchacon3 - Troubleshooting teleport access for a Gitaly team member - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5627<br><br><!-- STATUS NOTE END --><br> | | [Foundations - "Keep the Lights On" General Operations Work - 2024](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1291) <br/> ~"group::Networking & Incident Management" | 2024-01-01 | 2025-03-04 | **2025-02-25**: <br>**Teleport**<br><br>- CorpSec handover<br><br>- @sabrams - We met and discussed the future ownership ([agenda notes](https://docs.google.com/document/d/1P9n_pSJZVHzGCzAlDMLMaMRrg5scZymX-Cg2lyUoLB8/edit?usp=sharing)). CorpSec will be opening a handbook update to finalize the agreement - https://gitlab.com/gitlab-com/gl-infra/mstaff/-/issues/291. A brief summary:<br><br>- CorpSec will own all Business/admin aspects and user/system provisioning.<br>- Production Engineering SREs will continue to field the teleport requests in slack until CorpSec works to move forward automation as part of the [Project Zero Console](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1426#note_2348753312) which they have agreed to take ownership of.<br>- Foundations will be available for maintenance work only which will be followed by CorpSec to keep the K8s cluster running. This is due to CorpSec no longer having staff with skills needed to maintain the infrastructure. If new or changed functionality is needed, CorpSec will invest in training or migrating off of K8s to a solution that they have more experience with.<br><br>- @ayeung - Supporting tasks for the transition - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26252<br><br>**config-mgmt**<br><br>- @pguinoiseau - Upgrade Google provider to v6 in config-mgmt: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26249<br><br>**Renovate** (these are MRs that needed additional work to merge)<br><br>- @mchacon3 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10340<br><br>**Rate limiting**<br><br>- @donnaalexandra - working to re-enable authenticated web API and web rate limits - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19084<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1291#note_2364870874_<br><!-- STATUS NOTE END --><br> | | [Foundations - Meta work](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1290) <br/> ~"group::Networking & Incident Management" | 2024-01-01 | 2025-10-15 | **2024-07-18**: <br>Q3 planning is underway in https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/25581.<br> | | [WAF is applied across all Saas Platforms](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1210) <br/> ~"group::Networking & Incident Management" | 2024-02-03 | 2024-05-07 | **2024-05-01**: <br>This quarter the Foundations team set out to understand the requirements for other teams to consume Cloudflare WAF. With the Blueprint and PoC both under final review, we've made a plan for bringing the Cloudflare WAF to Dedicated and are beginning the [execution of that in Q2](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1286). :construction_worker:<br><br>Now that the Blueprint and PoC issues have been closed, this Epic is complete and can be closed :partyparrot: :raised_hands:<br><br>A special shoutout to @mchacon3 :bow:, who has done an incredible job understanding the DNS requirements of Cloudflare and building out a sandbox PoC, and to @cmiskell :kiwi: for the advice and guidance on how to integrate Cloudflare into Dedicated :map:<br><br/><br/>**Nested Epics: 1**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1246+ <br/> | | [Dedicated is protected with a WAF](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1286) <br/> ~"group::Networking & Incident Management" | 2024-05-01 | 2024-08-09 | **2024-08-08**: <br>After discussing many options, the team has come to a decision on how to move forward with Cloudflare IP Access rules. Big thanks to everyone who contributed, the [plan is summarised](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/5775#note_2024885379) here.<br><br>Although we did not succeed in enabling the WAF for the first customer, Foundations did succeed in creating a module for WAF rules that was successfully consumed by Dedicated tenants (a giant shoutout to @pguinoiseau and @tkhandelwal3 who both nailed it!).<br><br>To keep things organized, we moved the last remaining issues from this epic to https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1359 where we will support completing the migration and continue iterating on the WAF module, helping to deliver anticipated features such as support for BYOD and Pages.<br><br>We also succeeded in collaborating with Environment Automation to move all Tenant DNS to be managed via Cloudflare :tada: This smooth transition would not have been possible without @mchacon3. Thanks Marcel!!<br><br>This epic is now done and ready to be closed :tada: :done:<br> | | [Teleport Backups](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1357) <br/> ~"group::Networking & Incident Management" | 2024-07-01 | 2024-08-09 | **2024-07-31**: <br>The epic is completed. We take daily backups and run automated tests to verify the Teleport Firestore backups.<br><br>![Screenshot_2024-07-31_at_7.12.57_PM](/uploads/698e24644e0964e7daaddeeb64e3658d/Screenshot_2024-07-31_at_7.12.57_PM.png)<br><br>![Screenshot_2024-08-01_at_2.44.31_PM](/uploads/bf15e8982eeda7a6d5844144ba98c3a0/Screenshot_2024-08-01_at_2.44.31_PM.png)<br><br>![Screenshot_2024-08-01_at_2.44.56_PM](/uploads/4cbd024d1e5cd3f7cf0e4b08b4b88d4e/Screenshot_2024-08-01_at_2.44.56_PM.png)<br><br><details><br><summary>2024-07-17</summary><br>All infrastructure resources (servers and databases), roles, and configurations moved from the staging Teleport instance to the production Teleport instance. I also revised all Teleport runbooks and beefed them with the new information. Next and last, I will be working on setting up a scheduled pipeline to restore the Firestore backups for the staging instance.<br></details><br><br><details><br><summary>2024-07-10</summary><br>To support the basic backup process, I created a daily schedule to back up the Firestore database. I was able to move all non-prod servers and databases from the staging Teleport cluster to the production Teleport cluster. I will continue with moving the non-prod configurations to the production Teleport cluster this week too. After this work is completed, I can start working on an automated scheduled job to test the diaster recovery process.<br></details><br><br><details><br><summary>2024-07-01</summary><br>Epic created - this work is being urgently prioritized to meet compliance expectations.<br></details><br> | | [WAF is available as consumable infrastructure for internal services](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1359) <br/> ~"group::Networking & Incident Management" | 2024-08-07 | 2024-12-06 | **2024-11-26**: <br>:tada: Cloudflare WAF has [successfully been enabled](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/6784) on professional_teal_sparrow (Dedicated customer C1)! :tada:<br><br>This concludes Foundations role in supporting the Dedicated WAF development and rollout and this epic can now be closed!<br><br>**Epic Highlights**<br><br>This story begins back in February when Foundations first spent [FY25Q1 onboarding to Dedicated and putting together a blueprint](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1210) for how a WAF could be delivered to Dedicated architecture. In [FY25Q2, Foundations developed a fully automated approach](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1286) to migrate Dedicated DNS to Cloudflare and inject a module that would enable the WAF.<br><br>- After facing a big surprise at the end of Q2 in learning that Cloudflare WAF custom rules do not affect Spectrum SSH traffic, we had to quickly pivot to find new solutions in FY25Q3.<br>- Stakeholders from Environment Automation and Foundations worked closely with Cloudflare to develop a workaround (known as [Option A](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/merge_requests/3937)). But as it took shape, it became apparent this solution was unmaintainable and introduced too much risk in being so fragile and difficult to understand.<br>- We then moved on to what was known as [Option B](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/merge_requests/3912), moving the IP allowlists to nginx. This was a safer and better understood approach, but required more thorough testing.<br>- There were more surprises along the way in learning that PrivateLink presented its own challenges with Option B, but the cross-team engineers involved carried on and worked with the customer to plan and execute a safe and well monitored rollout.<br><br>@mchacon3 and @mhuseinbasic carried a large portion of the development of these solutions and we would not have been able to deliver this key feature to a top customer if it weren't for their perseverance and hard work! :bow:<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1359#note_2227504853_<br><!-- STATUS NOTE END --><br><br/><br/>**Nested Epics: 1**<br/><br/>• https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1351+ <br/> | | [Improve rate limiting interface between SREs and Support](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1278) <br/> ~"group::Networking & Incident Management" | 2024-08-07 | 2024-11-01 | **2024-10-30**: <br>:spaghetti: [**Simplifying Rate Limiting Configuration**](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/rate_limiting_simplification/)** Design Doc has been accepted!**<br><br>This has been an incredible learning experience, thank you to everyone who has reviewed and provided feedback! We're excited to make progress on [Phase 1: simplifying our edge network and bypass configuration](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1434) as we move into Q4. :yay-frog:<br><br>:bar_chart: **We have Application Rate Limiting Metrics!**<br><br>The [Rate Limiting Dashboard](https://dashboards.gitlab.net/d/rate-limiting-rate-limiting_overview/rate-limiting3a-rate-limiting3a-overview?orgId=1&from=now-12h&to=now&timezone=browser) (scroll to the bottom to see shiny new metrics) now shows the rate each throttle is applied! This is supplemented by log links to enable further investigation when needed.<br><br>:sparkles: **This first epic for rate limiting improvements can now be closed** :sparkles:<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1278#note_2184727488_<br><!-- STATUS NOTE END --><br> | | [Phase 1.1: Simplify Rate Limiting Bypass Configuration for GitLab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1434) <br/> ~"group::Networking & Incident Management" | 2024-11-06 | 2025-02-21 | **2025-02-18**: <br>This epic is officially complete! :tada:<br><br>This work has significantly simplified the management of rate limit bypasses, improved visibility into bypass usage, and set the stage for more effective rate limit management in the future, which ultimately contributes to improved availability and stability of our platform.<br><br>:sparkles: **Achievements** :sparkles:<br><br>* Consolidated all GitLab.com rate limit bypass configuration in one place (instead of 7).<br>* All IPs on these lists have been attributed to either a customer, vendor, or a GitLab IP; the majority of which were unattributed at the beginning of this epic.<br>* Across at least 7 separate lists, we had 981 IPs listed. Consolidating these lists reveals we actually have 484 IPs on the bypass list: 253 for customers and 231 for internal services<br>* Cleaned up redundant allowlists, conducted an initial audit of IPs on the allowlist:<br>* Follow up issues where applicable have been created in: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/374+<br>* Set up the [Bypass Policy in the Handbook](https://handbook.gitlab.com/handbook/engineering/infrastructure/rate-limiting/bypass-policy/), and updated [rate-limiting runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/rate-limiting) to reflect the updated process.<br>* We fixed one of the path-based bypass rules that we found had an incorrect Regular Expression.<br><br>![image](/uploads/b789f64865f6f4947a487293e3796433/image.png)<br><br>:rabbit: **Hurdles Encountered** :rabbit:<br><br>* There were 7 different locations for configuring bypasses, not 3 like initially thought.<br>* Changes across Cloudflare, HAProxy (and Chef), and Vault required careful ordering of operations to prevent customer impact.<br>* Attributing IPs involved digging into old issues, commit histories, manual searches for IPs, and a lot of investigation.<br>* We encountered a few unexpected behaviours with Cloudflare:<br>* Limits on the number of IPs per rule<br>* Validating the order of rules worked as expected<br>* We accidentally caused an [incident](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19246) when trying to remove some of the path based bypasses from the HAProxy configuration, causing some customers to be unexpectedly rate limited. We adjusted our approach to instead move the bypasses to Cloudflare, like we had done for the other bypasses.<br><br>:track_next: **What's Next** :track_next:<br><br>* https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1452+<br>* In the second (and final part of Phase 1) we will be standardizing how we configure Cloudflare firewall rules (WAF) across GitLab.com, Dedicated, Cloud Connector, and lay the foundations for [unlocking rate limiting for Runway services](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/28#note_2346041342).<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1434#note_2352627986_<br><!-- STATUS NOTE END --><br> | | [Further improve rate limiting interface between SREs and Support](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1427) <br/> ~"group::Networking & Incident Management" | 2024-11-18 | 2025-04-04 | **2025-04-01**: <br>:tada: **This epic is ready to close** :tada:<br><br>**Achievements:**<br><br>:collaboration: Provisioned Support Engineering with access to Cloudflare Analytics, enabling them to triage networking related issues quickly, reducing the need for SRE support. This is something that was proposed almost two years ago, and we've finally been able to provision this access and have seen the benefits of this already.<br><br>:mag: Created [Troubleshooting Guidelines](https://handbook.gitlab.com/handbook/engineering/infrastructure/rate-limiting/troubleshooting/) to empower Support Engineers, and other team members at GitLab with the ability to diagnose rate limiting related issues while providing an escalation path to SREs where necessary.<br><br>:books: Provided documentation on [Managing Limits](https://handbook.gitlab.com/handbook/engineering/infrastructure/rate-limiting/managing-limits/) to provide a centralised place for GitLab team members to reference when trying to understand when, where, and how to configure rate limits for all our product offerings.<br><br>Thank you to @sarahwalker for your support on this epic.<br><br>Also, huge thanks to @cleveland @ahergenhan @kenneth and the other Support Engineers that we've collaborated with on this effort. :partyparrot:<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1427#note_2427597172_<br><!-- STATUS NOTE END --><br> | | [Phase 1.2: Standardize Cloudflare WAF Configuration](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1452) <br/> ~"group::Networking & Incident Management" | 2025-02-17 | 2025-05-23 | **2025-05-20**: <br>:partyparrot: **Closing status update!** :partyparrot:<br><br>At the beginning of the quarter, we had several wildly different sets of Cloudflare WAF custom and rate limiting rules scattered around the place. GitLab.com was using one set, Dedicated tenants were using a derivative set, and who knows what staging.gitlab.com or ops.gitlab.net were doing? :shrug: These rulesets were all managed piecemeal and none of them followed a consistent naming convention that made any sense to readers. :weary:<br><br>As part of this epic, we sought to update the [`cloudflare-waf-rules`](https://gitlab.com/gitlab-com/gl-infra/terraform-modules/cloudflare/cloudflare-waf-rules) Terraform module with the latest and greatest WAF rules from the cutting edge (that is, GitLab.com). This required us to do a thorough audit and determine which rules were worth bringing into the module, as well as removing rules that had outlived their usefulness. The Terraform module was updated with the resulting blessed set of rules. After that, we rolled out the Terraform module to be used in the `gprd`, `gstg` and `ops` environments. Now all 3 environments share the same base set of rules. :success:<br><br>We also established a [naming convention](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/cloudflare/intro.md#waf-rules-naming-convention) for WAF rules going forward. The convention will be enforced by CI for any updates to `config-mgmt` and `cloudflare-waf-rules`. :cop:<br><br>As we've been positioning `cloudflare-waf-rules` to be used as a component across all our different platforms, we built everything with this in mind, so the module is flexible enough for unusual use cases while opinionated enough that using the supplied defaults will confer excellent protection. To validate this, we worked with the [Cloud Connector](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26570) and [Dedicated](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/26365) teams to get them started using the radically revised version of the `cloudflare-waf-rules` module. :dogfood: Big shoutouts to those teams for their openness and patience! :collaboration:<br><br>With all that done, we are good to close this epic in the next Grand Review. :bow:<br><br>Brought to you by @sarahwalker and @ayeung<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1452#note_2513381742_<br><!-- STATUS NOTE END --><br> | | [Increase GKE node IP subnet size and migrate to GKE Dataplane V2](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1606) <br/> ~"group::Networking & Incident Management" | 2025-07-16 | 2025-08-01 | **2025-07-30**: <br>:tada: **achievements**:<br><br>* The `gprd-us-east1-c` GKE cluster has been successfully rebuilt, without any significant issues.<br>* **Update:** the `gprd-us-east1-d` GKE cluster has been successfully rebuilt as well!<br>* See [closing status update](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1606#note_2661215025) :tada:<br><br>:arrow_forward: **next**:<br><br>* ~~The rebuild of the final cluster `gprd-us-east1-d` will be done this Thursday.~~<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1606#note_2658153346_<br><!-- STATUS NOTE END --><br> | | [Check and remediate impact of Bitnami chart/docker image changes](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1650) <br/> ~"group::Networking & Incident Management" | 2025-07-28 | 2025-08-24 | **2025-08-19**: <br>:checkered_flag: **Closing status update**:<br><br>:tada: We are no longer using container images from `docker.io/bitnami` in any of our GKE clusters, keeping our pipelines and workloads safe after August 28th when Bitnami will unpublish their opensource container images.<br><br>:white_check_mark: This was achieved in several ways:<br>* replacing the Bitnami container images and/or Helm charts with alternatives when possible<br>* decommissioning unused Helm releases (Clickhouse PoC)<br>* updating the image repository references to use the `docker.io/bitnamilegacy` registry (or an internal mirror of) temporarily when alternatives are not yet available or not yet possible to migrate to for the following services:<br>* `ops.gitlab.net`'s Redis instance, currently using the built-in `bitnami/redis` chart dependency from the GitLab chart<br>* the Release environments' Redis and PostgreSQL instances, using the built-in Bitnami chart dependencies from the GitLab chart as well<br>* Sentry, with several built-in Bitnami chart dependencies<br>* the PubSub and Registry Cache Redis deployments, using an old version of the `bitnami/redis` chart and currently frozen until replaced/decommissioned.<br><br>:arrow_forward: The migration away from those remaining Bitnami charts and images will be addressed in separate issues in their own time:<br>* `ops.gitlab.net`'s Redis instance will be replaced with a Redis alternative (TBD): https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1671<br>* the Release environments will be addressed in https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/21439<br>* Sentry is [being addressed upstream](https://github.com/sentry-kubernetes/charts/issues/1828), if we don't [decommission it](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4285) before then<br>* the PubSub and Registry Cache Redis deployments will be moved back to VMs in the coming months: https://gitlab.com/groups/gitlab-com/gl-infra/data-access/durability/-/epics/24<br>* the built-in Bitnami dependencies in the GitLab Helm charts are being discussed in https://gitlab.com/gitlab-org/charts/gitlab/-/issues/6089<br><br>_Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1650#note_2693839379_<br><!-- STATUS NOTE END --><br> | | [Use Flux as GitOps solution for infrastructure workloads](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1468) <br/> ~"group::Networking & Incident Management" | | 2025-04-17 | | | [Consider the future of Network Policy and our use of GKE](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/856) <br/> ~"group::Networking & Incident Management" | | 2025-02-26 | | | [Foundations Compliance issues](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1182) <br/> ~"group::Networking & Incident Management" | | 2024-12-26 | **2024-12-26**: <br>All issues are now closed, so this can be closed too.<br> | </details> ### :x: Cancelled Epics that were cancelled <details> | **Topic** | **Ended** | **Summary** | |-----------| ----------| ------------ | | [Foundations technical debt](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1191) | 2025-03-17 | **2025-03-17**: <br>Epic closed in favor of the KTLO epic. See https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1531+<br> | | [Documentation validation and refactoring to make it self serve](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1177) | 2024-04-02 | **2023-12-09**: This Epic is currently not being worked on, as the DRIs are still onboarding<br> | </details> ### :rotating_light: Epics that need attention These linked epics are not in the correct state or missing a workflow label <details> | **Topic** | **Links** | **Reason** | |-----------|-----------|-------------| | [Foundations corrective actions](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1550) <br/> (+0) <br/> group::Networking & Incident Management | Epic has ~"workflow-infra::Triage" but is closed | | [Networking & Incident Management - Vendor Relations](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1389) <br/> @sabrams (+0) <br/> group::Networking & Incident Management | Labeling problem, epic has ~"workflow-infra::Waiting" | </details> <!-- STATUS END -->
epic