diff --git a/.gitlab/CODEOWNERS b/.gitlab/CODEOWNERS index dd67c0fbbc2a119e6be311f4bc2201c6580a306e..c29e0b6aeacf12dda6f628b408a45388439c4063 100644 --- a/.gitlab/CODEOWNERS +++ b/.gitlab/CODEOWNERS @@ -190,7 +190,7 @@ /content/handbook/engineering/infrastructure/core-platform/systems/cloud-connector/ @pjphillips /content/handbook/engineering/infrastructure/core-platform/systems/distribution/ @cdu1 @plu8 /content/handbook/engineering/infrastructure/core-platform/systems/geo/ @luciezhao -/content/handbook/engineering/infrastructure/core-platform/systems/gitaly.md @andrashorvath @jcaigitlab +/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/ @andrashorvath @jcaigitlab /content/handbook/engineering/infrastructure/emergency-change-processes.md @marin @meks /content/handbook/engineering/infrastructure/incident-management/ @dawsmith @kwanyangu @marin /content/handbook/engineering/infrastructure/performance-indicators.md @cdu1 @marin @meks @vincywilson @rymai diff --git a/content/handbook/engineering/infrastructure/core-platform/systems/gitaly.md b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/_index.md similarity index 99% rename from content/handbook/engineering/infrastructure/core-platform/systems/gitaly.md rename to content/handbook/engineering/infrastructure/core-platform/systems/gitaly/_index.md index 6639329ca47a0e3f029775eec140c4174cb8cff8..52ea404d0ee4ac88965870995f2bceadebdd804a 100644 --- a/content/handbook/engineering/infrastructure/core-platform/systems/gitaly.md +++ b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/_index.md @@ -360,6 +360,7 @@ after the 18th that issue will be used to discuss what was brought up. ### Useful links +- [Debugging Gitaly](debug.md) - [Actual pending Infradev issues](https://10az.online.tableau.com/#/site/gitlab/views/DraftInfrastructureEmbeddedDashboard/InfrastructureEmbeddedDashboard?:iid=1) (sort by group, focus on gitaly) - [Out of SLO Infradev issues](https://10az.online.tableau.com/#/site/gitlab/workbooks/2219735/views) - [Error budget](https://dashboards.gitlab.net/d/stage-groups-detail-gitaly/stage-groups-gitaly-group-error-budget-detail?orgId=1&from=now-28d%2Fm&to=now) diff --git a/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/debug.md b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/debug.md new file mode 100644 index 0000000000000000000000000000000000000000..f10a9c3c1a82962bc80f81629dd99d7480c16326 --- /dev/null +++ b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/debug.md @@ -0,0 +1,156 @@ +--- +title: "Debugging the Gitaly service" +--- + +## About this document + +This document is intended for **Gitaly engineers**, to become familiar with GitLab's production layout and gain the ability to effectively debug production problems. While the focus is on SaaS, many of the skills transfer also to debugging self-managed instances. + +## Generic GitLab background + +Skim / read the following, focusing on an overview then on Gitaly: + +- [Production Architecture](../../../infrastructure/production/architecture/) +- [Monitoring](../../../../engineering/monitoring/#monitoring) + +Other useful links: + +- [Product sections, stages, groups, and categories](../../../../../product/categories/) +- [Features by Group](../../../../../product/categories/features/) + +### Gitaly specific background + +- Familiarize yourself with Gitaly's [README](https://gitlab.com/gitlab-org/gitaly/-/blob/master/README.md?ref_type=heads) +- Take a look at [SRE's runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/gitaly) + +### Gitaly in Production + +Both `gitlab.com` and Dedicated use Gitaly in "sharded" mode, that is, without Praefect (Gitaly Cluster). + +## Monitoring dashboards + +We have some useful pre-built monitoring dashboards on GitLab's internal Grafana instance. All dashboards are listed in [this folder](https://dashboards.gitlab.net/dashboards/f/gitaly/gitaly-service). Please note that some of them are fairly outdated. + +The following dashboards are most common: + +- [Gitaly: Overview](https://dashboards.gitlab.net/d/gitaly-main/gitaly3a-overview?orgId=1&var-PROMETHEUS_DS=default&var-environment=gprd&var-stage=main). This dashboard contains cluster-wide aggregated metrics. It is used to determine the overall health of the cluster and make it easy to spot any outlier node. +- [Gitaly: Host details](https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly3a-host-detail?orgId=1). This dashboard contains more detailed metrics of a particular node. +- [Gitaly Housekeeping statistics](https://dashboards.gitlab.net/d/Z2xwZIP7k/gitaly-housekeeping-statistics?orgId=1&refresh=5m). This dashboard shows detailed operational information of [Gitaly housekeeping feature](https://docs.gitlab.com/ee/administration/housekeeping.html). +- [Gitaly: Rebalance dashboard](https://dashboards.gitlab.net/d/gitaly-rebalancing/gitaly3a-rebalance-dashboard?from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=default&var-environment=gprd&var-fqdn=gitaly-cny-01-stor-gprd.c.gitlab-production.internal&orgId=1): This dashboard shows the relative balance between Gitaly nodes. It is used to determine when we need to relocate the repositories of a node to others. + +A Gitaly dashboard could be either auto-generated or manually drafted. We use Jsonnet (a superset of JSON) to achieve dashboards-as-code. The definitions of such dashboards are located [in this folder](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards/gitaly?ref_type=heads). Recently, that's the recommended way to manage an observability dashboard. It allows us to use GitLab's built-in libraries, resulting in a highly standardized dashboard. + +A standardized dashboard should have a top-level section containing environment filters, node filters, and useful annotations such as feature flag activities, deployments, etc. Some dashboards have an interlinked system that connects Grafana and Kibana with a single click. + +Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is "healthy" and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined [thresholds](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/gitaly.jsonnet), . We could contact [Scalability:Observability Team](../../../team/scalability/observability/) for any questions. + +data:image/s3,"s3://crabby-images/5059f/5059fb0850fc769c6f3ddf4a5624fd7137605a12" alt="Gitaly Debug Indicators" + +Some examples of using built-in dashboards to investigate production issues, from an Engineer's point of view: + +- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18156#note_1965772736 +- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15980#note_1457815084 +- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/23532#note_1374642198 + +## Gitaly's Prometheus metrics + +A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use [Prometheus](https://prometheus.io/docs/introduction/overview/) to collect metrics. To simplify, the Gitaly server exposes an HTTP server ([code](https://gitlab.com/gitlab-org/gitaly/-/blob/master/internal/cli/gitaly/serve.go#L514)) that allows Prometheus instances to fetch metrics periodically. + +In a dashboard, you can click on the top-right hamburger button and choose "Explore" to get access to the underlying metrics. Or you could use [the Explore page](https://dashboards.gitlab.net/explore) to play with metrics. + +data:image/s3,"s3://crabby-images/706c6/706c6318c421b52ccc6707ad92fabd77d3665f95" alt="Gitaly Debug Explore" + +Unfortunately, we don't have a curated list of all Gitaly metrics as well as their definition. So, you might need to look up their definition at multiple places. Here is [the list of all Gitaly-related metrics](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%22pum%22%3A%7B%22datasource%22%3A%22mimir-gitlab-gprd%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22group+by%28__name__%29+%28%7B__name__%3D%7E%5C%22.*gitaly.*%5C%22%2C+job%21%3D%5C%22prometheus%5C%22%7D%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22mimir-gitlab-gprd%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%2C%7B%22refId%22%3A%22B%22%2C%22expr%22%3A%22group+by%28__name__%29+%28%7Btype%3D%5C%22gitaly%5C%22%2C+job%21%3D%5C%22prometheus%5C%22%7D%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22mimir-gitlab-gprd%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-1h%22%2C%22to%22%3A%22now%22%7D%7D%7D&orgId=1). There are some sources + +- Node-level or environmental metrics. Those metrics are powered by other systems that host the Gitaly process. They are not exposed by Gitaly but are very useful, for example: CPU metrics, memory metrics, or cgroup metrics. +- Gitaly-specific metrics. Those metrics are accounted for directly in the code. Typically, they have `gitaly_` prefixes. +- Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly's aggregated metrics is listed [in this file](https://gitlab.com/gitlab-com/runbooks/-/blob/master/mimir-rules/gitlab-gprd/gitaly/gitaly.yml). + +data:image/s3,"s3://crabby-images/52883/528832554d607169fc6f5beb2222a1e37293aafd" alt="Gitaly Debug Metric Lists" + +In the code, you'll see something like the following. Any registered metrics are available when Prometheus scrapes from the endpoint. Tracing those instances, you could find the usage of Gitaly-specific metrics. + +```go +repoCounter := counter.NewRepositoryCounter(cfg.Storages) +prometheus.MustRegister(repoCounter) + +packObjectsServedBytes = promauto.NewCounter(prometheus.CounterOpts{ + Name: "gitaly_pack_objects_served_bytes_total", + Help: "Number of bytes of git-pack-objects data served to clients", +}) +``` + +A metric has a set of labels. GitLab adds the following set of labels to all metrics: + +- `env` or `environment`: the environment, including but not limited to `gprd`, `gstg`, `ops`, to name a few. +- `fqdn`: the fully qualified domain name. As Gitaly runs on VMs now, this label is equivalent to the identity of the hosting node. +- `region` and `zone`: the region and zone of the node. +- `stage`: the current stage of the process, either `main` or `cny`. +- `service`/`type`: for Gitaly, it's always `gitaly`. + +In the future, when Gitaly runs on K8s, we properly have more K8s-specific labels. + +The query uses [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) language. Some examples: + +- [Calculate the rate (ops/s) of pack-refs housekeeping task by node](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%22xxn%22:%7B%22datasource%22:%22PA258B30F88C30650%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22PA258B30F88C30650%22%7D,%22exemplar%22:true,%22expr%22:%22sum%28rate%28gitaly_housekeeping_tasks_total%7Benvironment%3D%5C%22gprd%5C%22,%20housekeeping_task%3D%5C%22packed_refs%5C%22%7D%5B$__rate_interval%5D%29%29%20by%20%28fqdn%29%20%3E%200%22,%22hide%22:false,%22interval%22:%22%22,%22legendFormat%22:%22%7B%7Bhousekeeping_task%7D%7D%22,%22refId%22:%22B%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-6h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1). +- [Calculate the dropped pack-objects/RPC requests due to limited in the last 2 days](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%22rmc%22:%7B%22datasource%22:%22mimir-gitlab-gprd%22,%22queries%22:%5B%7B%22expr%22:%22sum%28rate%28gitaly_pack_objects_dropped_total%7Benv%3D%5C%22gprd%5C%22,environment%3D%5C%22gprd%5C%22,type%3D%5C%22gitaly%5C%22%7D%5B$__rate_interval%5D%29%29%20by%20%28fqdn,%20reason%29%20%3E%200%5Cn%22,%22format%22:%22time_series%22,%22interval%22:%22$__interval%22,%22intervalFactor%22:1,%22legendFormat%22:%22Pack-objects%20%7B%7Bfqdn%7D%7D%20%7B%7Breason%7D%7D%22,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22mimir-gitlab-gprd%22%7D,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D,%7B%22refId%22:%22B%22,%22expr%22:%22sum%28rate%28gitaly_requests_dropped_total%7Benv%3D%5C%22gprd%5C%22,environment%3D%5C%22gprd%5C%22,type%3D%5C%22gitaly%5C%22%7D%5B$__rate_interval%5D%29%29%20by%20%28fqdn,%20reason%29%20%3E%200%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22mimir-gitlab-gprd%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22Requests%20%7B%7Bfqdn%7D%7D%20%7B%7Breason%7D%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-2d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1) +- [Calculate inflight commands of gitaly-cny node](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%22imy%22:%7B%22datasource%22:%22mimir-gitlab-gprd%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22gitaly_commands_running%7Benv%3D%5C%22gprd%5C%22,%20fqdn%3D%5C%22gitaly-cny-01-stor-gprd.c.gitlab-production.internal%5C%22%7D%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22mimir-gitlab-gprd%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22__auto%22%7D%5D,%22range%22:%7B%22from%22:%22now-30d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1). As you can see, there was as peak on 2024-06-17. It was when [this incident](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18156) occurs. + +## Debugging and performance testing tools + +- [gprcurl](https://github.com/fullstorydev/grpcurl): `curl` like tool but for gPRC +- [grpcui](https://github.com/fullstorydev/grpcui): lightweight `postman` like tool for gPRC +- [hyperfine](https://github.com/sharkdp/hyperfine): a performance tool that can benchmarks over time + - hyperfine can be used together with grpcurl to check the response time of a gPRC call + +### strace + +`strace(1)` a gitaly process: + +```shell +strace -fttTyyy -s 1024 -o /paht/filename -p $(pgrep -fd, gitaly) +``` + +Or wrap a process to make it easy to strace, especially if it then spawns more processes: + +```shell +#!/bin/bash/sh +echo $(date)" $PPID $@" >> /tmp/gitlab-shell.txt +exec /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig "$@" +# strace -fttTyyy -s 1024 -o /tmp/sshd_trace-$PPID /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig +``` + +[`strace` parser](https://gitlab.com/gitlab-com/support/toolbox/strace-parser) is useful to make the results more readable. + +### fast-stats + +[fast-stats](https://gitlab.com/gitlab-com/support/toolbox/fast-stats) is a useful tool developed by Support to quickly pull statistics from GitLab logs. + +#### Examples + +To find in one interval of 60m duration what the top methods called are from the gitaly logs. + +```shell +fast-stats --interval 60m --limit 1 var/log/gitlab/gitaly/current +``` + +To find the top 10 User, Project, Client by Duration calling that method: + +```shell +grep PostUploadPackWithSidechannel var/log/gitlab/gitaly/current | ~/bin/fast-stats --interval 60m top +``` + +## Log analysis + +Kibana (Elastic) Dashboards + +- [gstg](https://nonprod-log.gitlab.net/app/r/s/J0jWx) +- [gprd](https://log.gprd.gitlab.net/app/r/s/XuXAI) + +## Capacity management + +Gitaly team is responsible for maintaining reasonable serving capacity for gitlab.com. + +We get alerts from Tamland if capacity runs low, see [this issue comment](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1666#note_1786916965). + +[Capacity planning](../../../team/scalability/observability/capacity_planning/) documentation explains how this works in general. diff --git a/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-explore.png b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-explore.png new file mode 100644 index 0000000000000000000000000000000000000000..be15a45426c79005dd2c450726f89960c1189c83 Binary files /dev/null and b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-explore.png differ diff --git a/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-indicators.png b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-indicators.png new file mode 100644 index 0000000000000000000000000000000000000000..803b26a9939856654bbc6a1e0df2c3b9e982a78a Binary files /dev/null and b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-indicators.png differ diff --git a/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-list-metrics.png b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-list-metrics.png new file mode 100644 index 0000000000000000000000000000000000000000..71fcffc7f1b3d18abef4424f5d4bf58a283848a3 Binary files /dev/null and b/content/handbook/engineering/infrastructure/core-platform/systems/gitaly/gitaly-debug-list-metrics.png differ