Container Registry: Database Load Balancing (DLB)
## Context At GitLab, ensuring high availability, fault tolerance, and efficient resource utilization is paramount. Database Load Balancing (DLB) is crucial to address these challenges for services with a database (DB) backend by actively distributing queries across multiple backend hosts. ## Problem All DB queries are currently directed to the primary node. This was the path of least resistance when we first released the metadata DB, and the load since then has not been a concern. However, this is starting to change now that we have released more features that rely on the metadata DB, especially those that involve the processing of heavy queries, such as for storage usage calculations. See [here](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#motivation) for additional details. ## Goal Implement a DLB feature to ensure efficient utilization of database resources, enhanced fault tolerance, and improved system performance and reliability. More details [here](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#goals). ## Specification A technical specification is available at [docs/spec/gitlab/database-load-balancing.md](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/gitlab/database-load-balancing.md?ref_type=heads). ## Implementation Plan This effort will be comprised of two main _sequential_ phases. Tasks listed in the respective phases should be completed in the listed order. Given the length of this effort, to avoid having to rewrite/reorder/close issues as we progress (and refine the scope and our understanding of the next tasks) we have not raised all epics/issues ahead of time. We will be raising them gradually, at least one milestone in advance of having them scheduled. ### Phase 1 https://gitlab.com/groups/gitlab-org/-/epics/14105+ <details><summary>Click to expand</summary> This phase includes: - Implementation of all the *minimum required* planned features; - The required Distribution (self-served) and Infrastructure (partially self-served) changes; - Rollout on GitLab.com using DLB for a _single_ API endpoint. The selected endpoint is the most pressing one in terms of resources saturation (storage calculations) *and* the lowest in risk (not part of the push/pull flow). #### Development - [x] New configuration settings ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#configuration)) - https://gitlab.com/gitlab-org/container-registry/-/issues/1275 - [x] Load balancing - https://gitlab.com/gitlab-org/container-registry/-/issues/1276 - [x] Support fixed list of read-only hosts - [x] Abstract read-write (default) and read-only pools - [x] Select replica using round-robin - [x] Service discovery ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#service-discovery)) - https://gitlab.com/gitlab-org/container-registry/-/issues/1291 - [x] Lookup DNS record for replicas - [x] Resolve FQDNs and probe IP addresses - [x] Fault tolerance ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#fault-tolerance)) - https://gitlab.com/gitlab-org/container-registry/-/issues/1292 - [x] Refresh replica list periodically - [x] Fallback to the primary server on replica unavailability - [x] Primary sticking ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#primary-sticking)) - https://gitlab.com/gitlab-org/container-registry/-/issues/1306 - [x] Record repository-scoped primary LSN on writes - [x] Compare replica LSN on reads to pick target node - [x] Observability ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#observability)) - https://gitlab.com/gitlab-org/container-registry/-/issues/1316 - [x] Emit relevant log entries - [x] Expose relevant Prometheus metrics - [x] Traffic split for GitLab API ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#gitlab-api)) - [x] Use read-only replicas for `GET /gitlab/v1/repositories/<path>/` endpoint - https://gitlab.com/gitlab-org/container-registry/-/issues/1317 #### Distribution - [x] Add new configuration settings to GitLab Helm chart (self-serve) #### Infrastructure and Release - [x] Enable in pre-production/staging (self-serve) - [x] Update Grafana Database Detail dashboard with new application-side metrics (self-serve) - [x] Create new alerts based on the new application-side metrics (self-serve) - [x] Create Runbook on how to debug (self-serve) - [x] Production readiness review - [x] Enable in production (self-serve) </details> ### Phase 2 <details><summary>Click to expand</summary> This phase includes the development of all the remaining features and the gradual roll out of traffic splitting for the remaining API endpoints (ordered by criticality, ascending). #### Infrastructure and Release - [x] Provision new Redis Cluster dedicated to load balancing #### Development - [x] Expose settings for dedicated load balancing Redis setup - [x] Fault tolerance ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#fault-tolerance)) - [x] Refresh replica list on network errors immediately - [x] Expire potential stale connections gracefully - [x] Replication lag awareness ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#replication-lag)) - [x] Actively monitor replicas' lag - [x] Quarantine replicas exceeding thresholds - [x] Observability ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#observability)) - [x] Emit remaining log entries - [x] Expose remaining Prometheus metrics - [x] Update Grafana dashboard with newly exposed application-side metrics - [x] Create alerts based on the newly exposed application-side metrics - [x] Extend health check endpoint - [x] Traffic split for GitLab API ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#gitlab-api)) - [x] Use read-only replicas for `GET /gitlab/v1/repository-paths/<path>/repositories/list/` endpoint - [x] Use read-only replicas for `GET /gitlab/v1/repositories/<path>/tags/list/` endpoint - [x] Split traffic for OCI Distribution API ([spec](https://gitlab.com/gitlab-org/container-registry/-/blob/5de4f99564d91ff7b29411ece7f5ead7191a429b/docs/spec/gitlab/database-load-balancing.md#oci-distribution-api)) - [x] Use read-only replicas for `GET /v2/<name>/tags/list` endpoint - [x] Use read-only replicas for `GET/HEAD /v2/<name>/blobs/<digest>` endpoint - [x] Use read-only replicas for `GET /v2/<name>/blobs/uploads/<reference>` endpoint - [x] Use read-only replicas for `PATCH /v2/<name>/blobs/uploads/<reference>` endpoint - [x] Use read-only replicas for `GET|HEAD /v2/<name>/manifests/<reference>` endpoint </details> ## Status https://gitlab.com/groups/gitlab-org/-/epics/8591#note_2843301277 ## Owners * Team: [Container Registry](https://handbook.gitlab.com/handbook/engineering/development/ops/package/container-registry) * Most appropriate slack channel to reach out to: `#g_container-registry` * Best individual to reach out to: @jdrpereira / @suleimiahmed * PM: @trizzi * EM: @crystalpoole
epic