Tracing: Have a deeper look into Git process with trace2
Gitaly offloads a lot of complicated operations to git processes. When something goes wrong, it's quite hard to debug deeper into Git processes. In local environment, it's easy to plug some performance tools to measure how git process performs. It's not the case for the production environment. We can use some exposed logs, metrics, or similar statistical data. I wish we had a full descriptive picture, step by step, of an action. Distributed tracing may solve this problem. However, it has two problems:
- Most spans stop at git-spawning steps.
- Tracing captures by sampling (this can be solved in #4808 (closed))
Git bundles a tool called Trace2. This tool enables Git processes to print underlying significant events and corresponding measurements. It collects the activities of child processes as well. Obviously, it adds some overhead to git processes and maybe some risks when passing the events in a fd. It may not very wise to enable Trace2 for all Git processes. However, it's a perfect fits for distributed tracing. We are sampling data at a very low rate. The exported events are also convenient to transform into Tracing spans.
So, the initial idea is to enable Trace2 for sampled trace only (blocked by Tracing: Add span's sampling status check to la... (#4805 - closed)). Apart from being used by distributed tracing, trace2 is also useful to export some underlying metrics, such as pack-objects (#4699 (closed)). To serve more purposes in the future, it makes sense to convert the trace2 events to an internal representation. This data structure should provide some extension hooks.
Example of trace2 output:
16:59:25.088843 common-main.c:50 | d1 | main | version | | | | | 2.39.1
16:59:25.089198 common-main.c:51 | d1 | main | start | | 0.003062 | | | git fetch origin master
16:59:25.089396 repository.c:136 | d1 | main | def_repo | r1 | | | | worktree:/Users/qmnguyen/www/gitlab-development-kit/gitaly
16:59:25.089532 git.c:461 | d1 | main | cmd_name | | | | | fetch (fetch)
16:59:25.089738 read-cache.c:2478 | d1 | main | region_enter | r1 | 0.003603 | | index | label:do_read_index .git/index
16:59:25.089858 cache-tree.c:628 | d1 | main | region_enter | r1 | 0.003724 | | cache_tree | ..label:read
16:59:25.089898 cache-tree.c:630 | d1 | main | region_leave | r1 | 0.003764 | 0.000040 | cache_tree | ..label:read
16:59:25.089907 read-cache.c:2424 | d1 | main | data | r1 | 0.003772 | 0.000169 | index | ..read/version:2
16:59:25.089911 read-cache.c:2426 | d1 | main | data | r1 | 0.003777 | 0.000174 | index | ..read/cache_nr:1579
16:59:25.089915 read-cache.c:2483 | d1 | main | region_leave | r1 | 0.003781 | 0.000178 | index | label:do_read_index .git/index
16:59:25.091633 builtin/fetch.c:1646 | d1 | main | region_enter | r1 | 0.005499 | | fetch | label:remote_refs
16:59:25.091659 run-command.c:722 | d1 | main | child_start | | 0.005523 | | | ..[ch0] class:transport/ssh argv:[ssh -o SendEnv=GIT_PROTOCOL git@gitlab.com 'git-upload-pack '\''gitlab-org/gitaly.git'\''']
16:59:28.492940 connect.c:167 | d1 | main | data | | 3.406799 | 3.401300 | transfer | ..negotiated-version:2
16:59:29.287857 builtin/fetch.c:1649 | d1 | main | region_leave | r1 | 4.201707 | 4.196208 | fetch | label:remote_refs
16:59:29.310263 run-command.c:722 | d1 | main | child_start | | 4.224124 | | | [ch1] class:? argv:[git rev-list --objects --stdin --not --all --quiet --alternate-refs]
16:59:29.556997 run-command.c:979 | d1 | main | child_exit | | 4.470853 | 0.246729 | | [ch1] pid:86460 code:0
16:59:29.557046 builtin/fetch.c:1353 | d1 | main | region_enter | r1 | 4.470911 | | fetch | label:consume_refs
From gitlab.com:gitlab-org/gitaly
* branch master -> FETCH_HEAD
16:59:29.557774 builtin/fetch.c:1357 | d1 | main | region_leave | r1 | 4.471639 | 0.000728 | fetch | label:consume_refs
16:59:29.887522 run-command.c:979 | d1 | main | child_exit | | 4.801369 | 4.795846 | | [ch0] pid:86455 code:0
16:59:29.889130 run-command.c:1753 | d1 | main | region_enter | | 4.802994 | | submodule | label:parallel/fetch max:1
16:59:29.889251 run-command.c:1790 | d1 | main | region_leave | | 4.803116 | 0.000122 | submodule | label:parallel/fetch
16:59:29.889282 run-command.c:722 | d1 | main | child_start | | 4.803144 | | | [ch2] class:? argv:[git maintenance run --auto --no-quiet]
16:59:29.911777 common-main.c:50 | d2 | main | version | | | | | 2.39.1
16:59:29.912529 common-main.c:51 | d2 | main | start | | 0.010722 | | | /opt/homebrew/Cellar/git/2.39.1/libexec/git-core/git maintenance run --auto --no-quiet
16:59:29.912827 repository.c:136 | d2 | main | def_repo | r1 | | | | worktree:/Users/qmnguyen/www/gitlab-development-kit/gitaly
16:59:29.912989 git.c:461 | d2 | main | cmd_name | | | | | maintenance (fetch/maintenance)
16:59:29.913700 git.c:721 | d2 | main | exit | | 0.011896 | | | code:0
16:59:29.913708 trace2/tr2_tgt_perf.c:216 | d2 | main | atexit | | 0.011903 | | | code:0
16:59:29.914019 run-command.c:979 | d1 | main | child_exit | | 4.827880 | 0.024736 | | [ch2] pid:86462 code:0
16:59:29.914067 git.c:721 | d1 | main | exit | | 4.827933 | | | code:0
16:59:29.914080 trace2/tr2_tgt_perf.c:216 | d1 | main | atexit | | 4.827945 | | | code:0