Skip to content
Snippets Groups Projects

doc: Debugging Gitaly

Merged Andras Horvath requested to merge ahorvath-debugdoc-20240628 into main
All threads resolved!
---
title: "Debugging the Gitaly service"
---
## About this document
This document is intended for **Gitaly engineers**, to become familiar with GitLab's production layout and gain the ability to effectively debug production problems. While the focus is on SaaS, many of the skills transfer also to debugging self-managed instances.
## Generic GitLab background
Skim / read the following, focusing on an overview then on Gitaly:
- [Production Architecture](../../../infrastructure/production/architecture/)
- [Monitoring](../../../../engineering/monitoring/#monitoring)
Other useful links:
- [Product sections, stages, groups, and categories](../../../../../product/categories/)
- [Features by Group](../../../../../product/categories/features/)
### Gitaly specific background
- Familiarize yourself with Gitaly's [README](https://gitlab.com/gitlab-org/gitaly/-/blob/master/README.md?ref_type=heads)
- Take a look at [SRE's runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/gitaly)
### Gitaly in Production
Both `gitlab.com` and Dedicated use Gitaly in "sharded" mode, that is, without Praefect (Gitaly Cluster).
## Monitoring dashboards
We have some useful pre-built monitoring dashboards on GitLab's internal Grafana instance. All dashboards are listed in [this folder](https://dashboards.gitlab.net/dashboards/f/gitaly/gitaly-service). Please note that some of them are fairly outdated.
The following dashboards are most common:
- [Gitaly: Overview](https://dashboards.gitlab.net/d/gitaly-main/gitaly3a-overview?orgId=1&var-PROMETHEUS_DS=default&var-environment=gprd&var-stage=main). This dashboard contains cluster-wide aggregated metrics. It is used to determine the overall health of the cluster and make it easy to spot any outlier node.
- [Gitaly: Host details](https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1). This dashboard contains more detailed metrics of a particular node.
- [Gitaly Housekeeping statistics](https://dashboards.gitlab.net/d/Z2xwZIP7k/gitaly-housekeeping-statistics?orgId=1&refresh=5m). This dashboard shows detailed operational information of [Gitaly housekeeping feature](https://docs.gitlab.com/ee/administration/housekeeping.html).
- [Gitaly: Rebalance dashboard](https://dashboards.gitlab.net/d/gitaly-rebalancing/gitaly3a-rebalance-dashboard?from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=default&var-environment=gprd&var-fqdn=gitaly-cny-01-stor-gprd.c.gitlab-production.internal&orgId=1): This dashboard shows the relative balance between Gitaly nodes. It is used to determine when we need to relocate the repositories of a node to others.
A Gitaly dashboard could be either auto-generated or manually drafted. We use Jsonnet (a superset of JSON) to achieve dashboards-as-code. The definitions of such dashboards are located [in this folder](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards/gitaly?ref_type=heads). Recently, that's the recommended way to manage an observability dashboard. It allows us to use GitLab's built-in libraries, resulting in a highly standardized dashboard.
A standardized dashboard should have a top-level section containing environment filters, node filters, and useful annotations such as feature flag activities, deployments, etc. Some dashboards have an interlinked system that connects Grafana and Kibana with a single click.
Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is "healthy" and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined [thresholds](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/gitaly.jsonnet), . We could contact [Scalability:Observability Team](../../../team/scalability/observability/) for any questions.
![Gitaly Debug Indicators](gitaly-debug-indicators.png)
Some examples of using built-in dashboards to investigate production issues, from an Engineer's point of view:
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18156#note_1965772736
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15980#note_1457815084
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/23532#note_1374642198
## Gitaly's Prometheus metrics
A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use [Prometheus](https://prometheus.io/docs/introduction/overview/) to collect metrics. To simplify, the Gitaly server exposes an HTTP server ([code](https://gitlab.com/gitlab-org/gitaly/-/blob/master/internal/cli/gitaly/serve.go#L514)) that allows Prometheus instances to fetch metrics periodically.
In a dashboard, you can click on the top-right hamburger button and choose "Explore" to get access to the underlying metrics. Or you could use [the Explore page](https://dashboards.gitlab.net/explore) to play with metrics.
![Gitaly Debug Explore](gitaly-debug-explore.png)
Unfortunately, we don't have a curated list of all Gitaly metrics as well as their definition. So, you might need to look up their definition at multiple places. Here is [the list of all Gitaly-related metrics](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%22pum%22%3A%7B%22datasource%22%3A%22mimir-gitlab-gprd%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22group+by%28__name__%29+%28%7B__name__%3D%7E%5C%22.*gitaly.*%5C%22%2C+job%21%3D%5C%22prometheus%5C%22%7D%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22mimir-gitlab-gprd%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%2C%7B%22refId%22%3A%22B%22%2C%22expr%22%3A%22group+by%28__name__%29+%28%7Btype%3D%5C%22gitaly%5C%22%2C+job%21%3D%5C%22prometheus%5C%22%7D%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22mimir-gitlab-gprd%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-1h%22%2C%22to%22%3A%22now%22%7D%7D%7D&orgId=1). There are some sources
- Node-level or environmental metrics. Those metrics are powered by other systems that host the Gitaly process. They are not exposed by Gitaly but are very useful, for example: CPU metrics, memory metrics, or cgroup metrics.
- Gitaly-specific metrics. Those metrics are accounted for directly in the code. Typically, they have `gitaly_` prefixes.
- Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly's aggregated metrics is listed [in this file](https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules/gitlab-gprd/gitaly/gitaly.yml).
![Gitaly Debug Metric Lists](gitaly-debug-list-metrics.png)
In the code, you'll see something like the following. Any registered metrics are available when Prometheus scrapes from the endpoint. Tracing those instances, you could find the usage of Gitaly-specific metrics.
```go
repoCounter := counter.NewRepositoryCounter(cfg.Storages)
prometheus.MustRegister(repoCounter)
packObjectsServedBytes = promauto.NewCounter(prometheus.CounterOpts{
Name: "gitaly_pack_objects_served_bytes_total",
Help: "Number of bytes of git-pack-objects data served to clients",
})
```
A metric has a set of labels. GitLab adds the following set of labels to all metrics:
- `env` or `environment`: the environment, including but not limited to `gprd`, `gstg`, `ops`, to name a few.
- `fqdn`: the fully qualified domain name. As Gitaly runs on VMs now, this label is equivalent to the identity of the hosting node.
- `region` and `zone`: the region and zone of the node.
- `stage`: the current stage of the process, either `main` or `cny`.
- `service`/`type`: for Gitaly, it's always `gitaly`.
In the future, when Gitaly runs on K8s, we properly have more K8s-specific labels.
The query uses [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) language. Some examples:
- [Calculate the rate (ops/s) of pack-refs housekeeping task by node](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%22xxn%22:%7B%22datasource%22:%22PA258B30F88C30650%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22PA258B30F88C30650%22%7D,%22exemplar%22:true,%22expr%22:%22sum%28rate%28gitaly_housekeeping_tasks_total%7Benvironment%3D%5C%22gprd%5C%22,%20housekeeping_task%3D%5C%22packed_refs%5C%22%7D%5B$__rate_interval%5D%29%29%20by%20%28fqdn%29%20%3E%200%22,%22hide%22:false,%22interval%22:%22%22,%22legendFormat%22:%22%7B%7Bhousekeeping_task%7D%7D%22,%22refId%22:%22B%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-6h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1).
- [Calculate the dropped pack-objects/RPC requests due to limited in the last 2 days](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%22rmc%22:%7B%22datasource%22:%22mimir-gitlab-gprd%22,%22queries%22:%5B%7B%22expr%22:%22sum%28rate%28gitaly_pack_objects_dropped_total%7Benv%3D%5C%22gprd%5C%22,environment%3D%5C%22gprd%5C%22,type%3D%5C%22gitaly%5C%22%7D%5B$__rate_interval%5D%29%29%20by%20%28fqdn,%20reason%29%20%3E%200%5Cn%22,%22format%22:%22time_series%22,%22interval%22:%22$__interval%22,%22intervalFactor%22:1,%22legendFormat%22:%22Pack-objects%20%7B%7Bfqdn%7D%7D%20%7B%7Breason%7D%7D%22,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22mimir-gitlab-gprd%22%7D,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D,%7B%22refId%22:%22B%22,%22expr%22:%22sum%28rate%28gitaly_requests_dropped_total%7Benv%3D%5C%22gprd%5C%22,environment%3D%5C%22gprd%5C%22,type%3D%5C%22gitaly%5C%22%7D%5B$__rate_interval%5D%29%29%20by%20%28fqdn,%20reason%29%20%3E%200%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22mimir-gitlab-gprd%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22Requests%20%7B%7Bfqdn%7D%7D%20%7B%7Breason%7D%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-2d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1)
- [Calculate inflight commands of gitaly-cny node](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%22imy%22:%7B%22datasource%22:%22mimir-gitlab-gprd%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22gitaly_commands_running%7Benv%3D%5C%22gprd%5C%22,%20fqdn%3D%5C%22gitaly-cny-01-stor-gprd.c.gitlab-production.internal%5C%22%7D%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22mimir-gitlab-gprd%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22__auto%22%7D%5D,%22range%22:%7B%22from%22:%22now-30d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1). As you can see, there was as peak on 2024-06-17. It was when [this incident](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18156) occurs.
## Debugging and performance testing tools
+1
- [gprcurl](https://github.com/fullstorydev/grpcurl): `curl` like tool but for gPRC
- [grpcui](https://github.com/fullstorydev/grpcui): lightweight `postman` like tool for gPRC
- [hyperfine](https://github.com/sharkdp/hyperfine): a performance tool that can benchmarks over time
- hyperfine can be used together with grpcurl to check the response time of a gPRC call
### strace
`strace(1)` a gitaly process:
```shell
strace -fttTyyy -s 1024 -o /paht/filename -p $(pgrep -fd, gitaly)
```
Or wrap a process to make it easy to strace, especially if it then spawns more processes:
```shell
#!/bin/bash/sh
echo $(date)" $PPID $@" >> /tmp/gitlab-shell.txt
exec /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig "$@"
# strace -fttTyyy -s 1024 -o /tmp/sshd_trace-$PPID /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig
```
[`strace` parser](https://gitlab.com/gitlab-com/support/toolbox/strace-parser) is useful to make the results more readable.
### fast-stats
[fast-stats](https://gitlab.com/gitlab-com/support/toolbox/fast-stats) is a useful tool developed by Support to quickly pull statistics from GitLab logs.
#### Examples
To find in one interval of 60m duration what the top methods called are from the gitaly logs.
```shell
fast-stats --interval 60m --limit 1 var/log/gitlab/gitaly/current
```
To find the top 10 User, Project, Client by Duration calling that method:
```shell
grep PostUploadPackWithSidechannel var/log/gitlab/gitaly/current | ~/bin/fast-stats --interval 60m top
```
## Log analysis
Kibana (Elastic) Dashboards
- [gstg](https://nonprod-log.gitlab.net/app/r/s/J0jWx)
- [gprd](https://log.gprd.gitlab.net/app/r/s/XuXAI)
## Capacity management
Gitaly team is responsible for maintaining reasonable serving capacity for gitlab.com.
We get alerts from Tamland if capacity runs low, see [this issue comment](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1666#note_1786916965).
[Capacity planning](../../../team/scalability/observability/capacity_planning/) documentation explains how this works in general.
Loading