Skip to content

Distributed Reads Mainly Target Primary

Promoted to &6733

Traffic to the https://gitlab.com/gitlab-com/www-gitlab-com repository on GitLab.com is growing and we're starting to load issues related to this.

Some recent issues include:

Plus more pagerduty alerts....


gitlab-com/www-gitlab-com is hosted using Gitaly Cluster. There are 3 nodes available to serve read traffic. If this traffic was well distributed across the 3 nodes, it's likely we'd be able to scale up to meet demand.

Unfortunately, traffic is being unevenly distributed, with the primary being a hotpot...

Here are traffic patterns over a 4 week period:

image

https://thanos-query.ops.gitlab.net/graph?g0.range_input=4w&g0.max_source_resolution=1h&g0.expr=sum%20by%20(fqdn)%20(rate(grpc_server_handled_total%7Bgrpc_method%3D%22PostUploadPack%22%2C%20env%3D%22gprd%22%2C%20shard%3D%22praefect%22%7D%5B1d%5D))&g0.tab=0

Request

  1. Distribute PostUploadPack more evenly between nodes.
  2. Provide observability to metrics which help illustrate the routing logic in Praefect. If these metrics/logs exist, improvements to the Praefect dashboards which illustrate this logic would be appreciated.
    1. Examples: why is Praefect using the primary? Are both replicas stale?

Verification

  1. Open https://thanos-query.ops.gitlab.net/graph?g0.range_input=4w&g0.stacked=0&g0.max_source_resolution=1h&g0.expr=sort_desc(%0A100*sum%20by%20(fqdn)%20(rate(grpc_server_handled_total%7Bgrpc_method%3D%22PostUploadPack%22%2C%20env%3D%22gprd%22%2C%20shard%3D%22praefect%22%7D%5B1d%5D))%0A%2F%20ignoring(fqdn)%20group_left()%0Asum(rate(grpc_server_handled_total%7Bgrpc_method%3D%22PostUploadPack%22%2C%20env%3D%22gprd%22%2C%20shard%3D%22praefect%22%7D%5B1d%5D))%0A)&g0.tab=0
  2. This chart shows the percentage of PostUploadPack requests going to each server. Ideally, these values should be close together.
  3. At time of writing 66% of requests are hitting the primary server and the replicas are only handling around 16% each.
  4. If one server is consistently being used much more than the others, this may indicate the problem still exists.

image

Edited by Mark Wood
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information