Andras Horvath · 2a5b947e · d0b5512e · cc91733c · 637d1073 · 606018ed
--- a/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/debug.md

+ 27

− 4
+++ b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/debug.md

+ 27

− 4
 @@ -42,7 +42,7 @@ A Gitaly dashboard could be either auto-generated or manually drafted. We use Js

 A standardized dashboard should have a top-level section containing environment filters, node filters, and useful annotations such as feature flag activities, deployments, etc. Some dashboards have an interlinked system that connects Grafana and Kibana with a single click.

-Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is "healthy" and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined [thresholds](https://gitlab.com/gitlab-com/runbooks/-/blob/d0a5ff2f0ae23984679e0cf6e3361c6d4e71550b/metrics-catalog/services/gitaly.jsonnet), . We could contact [Scalability:Observability Team](../../../infrastructure/team/scalability/observability/) for any questions.
+Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is "healthy" and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined [thresholds](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/gitaly.jsonnet), . We could contact [Scalability:Observability Team](../../../team/scalability/observability/) for any questions.

 ![Gitaly Debug Indicators](gitaly-debug-indicators.png)

 @@ -54,7 +54,7 @@ Some examples of using built-in dashboards to investigate production issues, fro

 ## Gitaly's Prometheus metrics

-A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use [Prometheus](https://prometheus.io/docs/introduction/overview/) to collect metrics. To simplify, the Gitaly server exposes an HTTP server ([code](https://gitlab.com/gitlab-org/gitaly/-/blob/3b872218f78151d681011e5ef2bbc22b3721f6a2/internal/cli/gitaly/serve.go#L514)) that allows Prometheus instances to fetch metrics periodically.
+A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use [Prometheus](https://prometheus.io/docs/introduction/overview/) to collect metrics. To simplify, the Gitaly server exposes an HTTP server ([code](https://gitlab.com/gitlab-org/gitaly/-/blob/master/internal/cli/gitaly/serve.go#L514)) that allows Prometheus instances to fetch metrics periodically.

 In a dashboard, you can click on the top-right hamburger button and choose "Explore" to get access to the underlying metrics. Or you could use [the Explore page](https://dashboards.gitlab.net/explore) to play with metrics.

 @@ -64,7 +64,7 @@ Unfortunately, we don't have a curated list of all Gitaly metrics as well as the

 - Node-level or environmental metrics. Those metrics are powered by other systems that host the Gitaly process. They are not exposed by Gitaly but are very useful, for example: CPU metrics, memory metrics, or cgroup metrics.
 - Gitaly-specific metrics. Those metrics are accounted for directly in the code. Typically, they have `gitaly_` prefixes.
- Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly's aggregated metrics is listed [in this file](https://gitlab.com/gitlab-com/runbooks/-/blob/e1d0ad78d24d51c36ea7dea28765ba16fd588d42/mimir-rules/gitlab-gprd/gitaly/gitaly.yml).
+- Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly's aggregated metrics is listed [in this file](https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules/gitlab-gprd/gitaly/gitaly.yml).

 ![Gitaly Debug Metric Lists](gitaly-debug-list-metrics.png)

 @@ -103,6 +103,25 @@ The query uses [PromQL](https://prometheus.io/docs/prometheus/latest/querying/ba
 - [hyperfine](https://github.com/sharkdp/hyperfine): a performance tool that can benchmarks over time
  - hyperfine can be used together with grpcurl to check the response time of a gPRC call

+### strace
+
+`strace(1)` a gitaly process:
+
+```shell
+strace -fttTyyy -s 1024 -o /paht/filename -p $(pgrep -fd, gitaly)
+```
+
+Or wrap a process to make it easy to strace, especially if it then spawns more processes:
+
+```shell
+#!/bin/bash/sh
+echo $(date)" $PPID $@" >> /tmp/gitlab-shell.txt
+exec /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig "$@"
+# strace -fttTyyy -s 1024 -o /tmp/sshd_trace-$PPID /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig
+```
+
+https://gitlab.com/gitlab-com/support/toolbox/strace-parser is useful to make the results more readable.
+
 ## Log analysis

 Kibana (Elastic) Dashboards
 @@ -112,4 +131,8 @@ Kibana (Elastic) Dashboards

 ## Capacity management

-TBD
+Gitaly team is responsible for maintaining reasonable serving capacity for gitlab.com.
+
+We get alerts from Tamland if capacity runs low, see [this issue comment](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1666#note_1786916965).
+
+[Capacity planning](../../../team/scalability/observability/capacity_planning/) documentation explains how this works in general.