Skip to content
Snippets Groups Projects

doc: Debugging Gitaly

Merged Andras Horvath requested to merge ahorvath-debugdoc-20240628 into main
All threads resolved!
@@ -42,7 +42,7 @@ A Gitaly dashboard could be either auto-generated or manually drafted. We use Js
A standardized dashboard should have a top-level section containing environment filters, node filters, and useful annotations such as feature flag activities, deployments, etc. Some dashboards have an interlinked system that connects Grafana and Kibana with a single click.
Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is "healthy" and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined [thresholds](https://gitlab.com/gitlab-com/runbooks/-/blob/d0a5ff2f0ae23984679e0cf6e3361c6d4e71550b/metrics-catalog/services/gitaly.jsonnet), . We could contact [Scalability:Observability Team](../../../infrastructure/team/scalability/observability/) for any questions.
Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is "healthy" and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined [thresholds](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/gitaly.jsonnet), . We could contact [Scalability:Observability Team](../../../team/scalability/observability/) for any questions.
![Gitaly Debug Indicators](gitaly-debug-indicators.png)
@@ -54,7 +54,7 @@ Some examples of using built-in dashboards to investigate production issues, fro
## Gitaly's Prometheus metrics
A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use [Prometheus](https://prometheus.io/docs/introduction/overview/) to collect metrics. To simplify, the Gitaly server exposes an HTTP server ([code](https://gitlab.com/gitlab-org/gitaly/-/blob/3b872218f78151d681011e5ef2bbc22b3721f6a2/internal/cli/gitaly/serve.go#L514)) that allows Prometheus instances to fetch metrics periodically.
A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use [Prometheus](https://prometheus.io/docs/introduction/overview/) to collect metrics. To simplify, the Gitaly server exposes an HTTP server ([code](https://gitlab.com/gitlab-org/gitaly/-/blob/master/internal/cli/gitaly/serve.go#L514)) that allows Prometheus instances to fetch metrics periodically.
In a dashboard, you can click on the top-right hamburger button and choose "Explore" to get access to the underlying metrics. Or you could use [the Explore page](https://dashboards.gitlab.net/explore) to play with metrics.
@@ -64,7 +64,7 @@ Unfortunately, we don't have a curated list of all Gitaly metrics as well as the
- Node-level or environmental metrics. Those metrics are powered by other systems that host the Gitaly process. They are not exposed by Gitaly but are very useful, for example: CPU metrics, memory metrics, or cgroup metrics.
- Gitaly-specific metrics. Those metrics are accounted for directly in the code. Typically, they have `gitaly_` prefixes.
- Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly's aggregated metrics is listed [in this file](https://gitlab.com/gitlab-com/runbooks/-/blob/e1d0ad78d24d51c36ea7dea28765ba16fd588d42/mimir-rules/gitlab-gprd/gitaly/gitaly.yml).
- Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly's aggregated metrics is listed [in this file](https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules/gitlab-gprd/gitaly/gitaly.yml).
![Gitaly Debug Metric Lists](gitaly-debug-list-metrics.png)
@@ -103,6 +103,25 @@ The query uses [PromQL](https://prometheus.io/docs/prometheus/latest/querying/ba
- [hyperfine](https://github.com/sharkdp/hyperfine): a performance tool that can benchmarks over time
- hyperfine can be used together with grpcurl to check the response time of a gPRC call
### strace
`strace(1)` a gitaly process:
```shell
strace -fttTyyy -s 1024 -o /paht/filename -p $(pgrep -fd, gitaly)
```
Or wrap a process to make it easy to strace, especially if it then spawns more processes:
```shell
#!/bin/bash/sh
echo $(date)" $PPID $@" >> /tmp/gitlab-shell.txt
exec /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig "$@"
# strace -fttTyyy -s 1024 -o /tmp/sshd_trace-$PPID /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig
```
https://gitlab.com/gitlab-com/support/toolbox/strace-parser is useful to make the results more readable.
## Log analysis
Kibana (Elastic) Dashboards
@@ -112,4 +131,8 @@ Kibana (Elastic) Dashboards
## Capacity management
TBD
Gitaly team is responsible for maintaining reasonable serving capacity for gitlab.com.
We get alerts from Tamland if capacity runs low, see [this issue comment](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1666#note_1786916965).
[Capacity planning](../../../team/scalability/observability/capacity_planning/) documentation explains how this works in general.
Loading