Distributed Reads Mainly Target Primary
Promoted to &6733
Traffic to the https://gitlab.com/gitlab-com/www-gitlab-com repository on GitLab.com is growing and we're starting to load issues related to this.
Some recent issues include:
- gitlab-com/gl-infra/production#4909 (closed)
- gitlab-com/gl-infra/production#4893 (closed)
- gitlab-com/gl-infra/production#4853 (closed)
Plus more pagerduty alerts....
gitlab-com/www-gitlab-com is hosted using Gitaly Cluster. There are 3 nodes available to serve read traffic. If this traffic was well distributed across the 3 nodes, it's likely we'd be able to scale up to meet demand.
Unfortunately, traffic is being unevenly distributed, with the primary being a hotpot...
Here are traffic patterns over a 4 week period:
Request
- Distribute
PostUploadPackmore evenly between nodes. - Provide observability to metrics which help illustrate the routing logic in Praefect. If these metrics/logs exist, improvements to the Praefect dashboards which illustrate this logic would be appreciated.
- Examples: why is Praefect using the primary? Are both replicas stale?
Verification
- Open https://thanos-query.ops.gitlab.net/graph?g0.range_input=4w&g0.stacked=0&g0.max_source_resolution=1h&g0.expr=sort_desc(%0A100*sum%20by%20(fqdn)%20(rate(grpc_server_handled_total%7Bgrpc_method%3D%22PostUploadPack%22%2C%20env%3D%22gprd%22%2C%20shard%3D%22praefect%22%7D%5B1d%5D))%0A%2F%20ignoring(fqdn)%20group_left()%0Asum(rate(grpc_server_handled_total%7Bgrpc_method%3D%22PostUploadPack%22%2C%20env%3D%22gprd%22%2C%20shard%3D%22praefect%22%7D%5B1d%5D))%0A)&g0.tab=0
- This chart shows the percentage of
PostUploadPackrequests going to each server. Ideally, these values should be close together. - At time of writing 66% of requests are hitting the primary server and the replicas are only handling around 16% each.
- If one server is consistently being used much more than the others, this may indicate the problem still exists.
Edited by Mark Wood

