2019-07-24-three-new-support-tools.html.md.erb 19.9 KB
Newer Older
Sara Kassabian's avatar
Sara Kassabian committed
1
---
2
title: "We've open sourced 3 tools to help troubleshoot system performance"
Sara Kassabian's avatar
Sara Kassabian committed
3 4 5
author: Sara Kassabian, Will Chandler
categories: engineering
image_title: '/images/blogimages/open_source_tools.jpg'
6
description: "Say hello to the open source tools our Support team is using to better summarize customer performance data  and find out how they can help you."
Sara Kassabian's avatar
Sara Kassabian committed
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
tags: open source, features, inside GitLab
twitter_text: "The @gitlab Support team gets more efficient with three new open source tools built to enhance GitLab performance"
featured: yes
postType: content marketing
ee_cta: false
---

Our self-managed customers often encounter issues related to performance, or the time it takes to execute something. In the past, the [Support team](/handbook/support/) had to pull data from disparate sources and cobble it together in order to analyze performance-related issues.

“We’re dealing with someone else’s computer on support, so we have to be able to handle environments with limited observability,” says [Will Chandler](/company/team/#wchandler), senior support engineer. “We’re at the mercy of their infrastructure. That’s why the team has made tools to reduce the friction.”

“With [GitLab.com](/pricing/), we have all of this fancy tooling that helps us collect performance data,” says [Lee Matos](/company/team/#leematos), support engineering manager. “But when we’re working with customers, we need to be ready to bring lightweight tools that don’t require a lot of setup that we can use based on what they have in place.”

The Support team is working on becoming more data driven by using three new tools designed to aggregate and summarize performance data for self-managed customers. A focus on data-driven decision-making improves the customer relationship and demonstrates our commitment to making performance a key feature of GitLab.

Rebecca Dodd's avatar
Rebecca Dodd committed
22
We'll look at three open source tools created by GitLab Self-Managed Support. Strace parser is a general tool that could be of use to anyone, while JSON Stats and GitLabSOS are tailored to GitLab, but could be easily modified.
23

24
## 1. [Strace parser](https://gitlab.com/gitlab-com/support/toolbox/strace-parser)
Sara Kassabian's avatar
Sara Kassabian committed
25

26 27
[Strace](https://gitlab.com/strace/strace) is a commonly used debugging and diagnostic tool in Linux that captures information about what’s happening inside processes running on our customers’ environments.

28
Unlike [newer](http://man7.org/linux/man-pages/man1/perf.1.html) and [more powerful](https://github.com/iovisor/bpftrace) tracing tools, strace adds [significant overhead to a process](http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html). However, strace is generally available even on very old versions of Linux.
Sara Kassabian's avatar
Sara Kassabian committed
29

Rebecca Dodd's avatar
Rebecca Dodd committed
30
An strace of a single-threaded program is linear, but following the threads of execution quickly gets difficult when there are many processes being captured. At GitLab Support we are typically tracing [Unicorn](https://bogomips.org/unicorn/) workers or [Gitaly](https://gitlab.com/gitlab-org/gitaly), which are highly concurrent, resulting in hundreds of process IDs being traced and hundreds of thousands of lines of output from traces only a few seconds long.
Sara Kassabian's avatar
Sara Kassabian committed
31

32
Will built [strace parser](https://gitlab.com/gitlab-com/support/toolbox/strace-parser) for these types of use cases. Strace parser summarizes the most meaningful processing data delivered by an strace in a more accessible format, allowing users to find the critical section sections of the data quickly.
Sara Kassabian's avatar
Sara Kassabian committed
33

34
The next two examples are from a GitLab customer that was using a very slow file system to host their .gitconfig file, which was a major performance bottleneck. But it was not immediately clear what was happening from the perspective of a user trying to troubleshoot. By running an strace on Gitaly, we were able to get a better understanding of why the system was so slow.
Sara Kassabian's avatar
Sara Kassabian committed
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

```
3694  13:45:06.207369 clock_gettime(CLOCK_MONOTONIC, {3016230, 201254200}) = 0 <0.000015>
3694  13:45:06.207409 futex(0x7f645bb49664, FUTEX_WAIT_BITSET_PRIVATE, 192398, {3016230, 299906871}, ffffffff <unfinished ...>
3542  13:45:06.209616 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.005236>
3542  13:45:06.209639 futex(0x1084ff0, FUTEX_WAKE, 1) = 1 <0.000023>
3510  13:45:06.209673 <... futex resumed> ) = 0 <0.002909>
3542  13:45:06.209701 futex(0xc420896548, FUTEX_WAKE, 1 <unfinished ...>
3510  13:45:06.209710 pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
16780 13:45:06.209740 <... futex resumed> ) = 0 <0.002984>
3542  13:45:06.209749 <... futex resumed> ) = 1 <0.000043>
16780 13:45:06.209776 pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL <unfinished ...>
3542  13:45:06.209787 futex(0xc420053548, FUTEX_WAKE, 1 <unfinished ...>
16780 13:45:06.209839 <... pselect6 resumed> ) = 0 (Timeout) <0.000056>
3544  13:45:06.209853 <... futex resumed> ) = 0 <0.003148>
3542  13:45:06.209861 <... futex resumed> ) = 1 <0.000069>
3510  13:45:06.209868 <... pselect6 resumed> ) = 0 (Timeout) <0.000151>
3544  13:45:06.209915 epoll_ctl(4<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 181<UNIX:[164869291]>, 0xc42105bb14 <unfinished ...>
16780 13:45:06.210076 write(1<pipe:[55447]>, "time=\"2019-02-14T18:45:06Z\" level=warning msg=\"health check failed\" error=\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\" worker.name=gitaly-ruby.4\n", 170 <unfinished ...>
3544  13:45:06.210093 <... epoll_ctl resumed> ) = 0 <0.000053>
3542  13:45:06.210101 futex(0x1089020, FUTEX_WAIT, 0, {0, 480025102} <unfinished ...>
3510  13:45:06.210109 pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
16780 13:45:06.210153 <... write resumed> ) = 170 <0.000064>
3544  13:45:06.210163 close(181<UNIX:[164869291]> <unfinished ...>
```

This strace delivers more than 300,000 lines about the different Gitaly processes running on this customer’s GitLab environment, making it challenging to decipher the flow of execution.
{: .note.text-center}

64
“In this case, we can use strace-parser to say, ‘Just give me all the files that were opened, and sort them by how long it took to open,’” says Will.
Sara Kassabian's avatar
Sara Kassabian committed
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

```
$ strace-parser trace.txt files --sort duration

Files Opened

      pid      dur (ms)       timestamp            error         file name
  -------    ----------    ---------------    ---------------    ---------
    24670      5203.999    13:45:16.152985           -           /efs/gitlab/home/.gitconfig
    24859      5296.580    13:45:23.367482           -           /efs/gitlab/home/.gitconfig
    24584      5279.810    13:45:09.286019           -           /efs/gitlab/home/.gitconfig
    24666      5276.975    13:45:16.079697           -           /efs/gitlab/home/.gitconfig
    24667      5255.649    13:45:16.101009           -           /efs/gitlab/home/.gitconfig
    14871      2594.364    13:45:18.762347           -           /efs/gitlab/home/.gitconfig
    24885      2440.635    13:45:26.224189           -           /efs/gitlab/home/.gitconfig
    24886      2432.980    13:45:26.231009           -           /efs/gitlab/home/.gitconfig
    24656        55.873    13:45:15.916836        ENOENT         /nfs/gitlab/gitdata/repositories/group/project.git/objects/info/alternates
    24688        42.764    13:45:21.522789        ENOENT         /nfs/gitlab/gitdata/repositories/group/project.git/objects/info/alternates
     3709        39.631    13:45:07.816618           -           /efs/gitlab/home/.gitconfig
    24583        37.959    13:45:09.218283           -           /efs/gitlab/home/.gitconfig
```

87
By summarizing the data in this way, we see multiple files that took 2-5 seconds to open, which is several orders of magnitude slower than expected.
Sara Kassabian's avatar
Sara Kassabian committed
88 89
{: .note.text-center}

90
“If it’s a particularly busy server and we’re performing these actions 50 times a second, 100 times a second, that adds up really fast,” says Will. “Strace-Parser lets you drill down quickly, and say, ‘OK, this specific thing we’re doing is super slow.’”
Sara Kassabian's avatar
Sara Kassabian committed
91

92
### Get a closer look at processes using strace-parser
Sara Kassabian's avatar
Sara Kassabian committed
93

94
Strace-Parser can also be used to drill down into details of a process.
Sara Kassabian's avatar
Sara Kassabian committed
95

96
The previous output showed PID 24670 is one of the slower processes, so we use the parser to understand how this slow call impacted the performance of the process overall.
Sara Kassabian's avatar
Sara Kassabian committed
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157

```
$ strace-parser trace.txt pid 24670

PID 24670

  271 syscalls, active time: 5303.438ms, user time: 34.662ms, total time: 5338.100ms
  start time: 13:45:16.116671    end time: 13:45:21.454771

  syscall                 count    total (ms)      max (ms)      avg (ms)      min (ms)    errors
  -----------------    --------    ----------    ----------    ----------    ----------    --------
  open                       29      5223.073      5203.999       180.106         0.031    ENOENT: 9
  read                       25        46.303        28.747         1.852         0.031
  access                     11         6.948         4.131         0.632         0.056    ENOENT: 3
  lstat                       6         5.116         2.130         0.853         0.077    ENOENT: 4
  mmap                       32         3.868         0.485         0.121         0.028
  openat                      2         3.757         2.934         1.878         0.823
  fstat                      28         3.395         0.272         0.121         0.033
  munmap                     11         2.551         0.929         0.232         0.056
  rt_sigaction               59         2.548         0.121         0.043         0.024
  close                      22         2.375         0.279         0.108         0.032
  mprotect                   14         0.927         0.174         0.066         0.032
  execve                      1         0.621         0.621         0.621         0.621
  brk                         6         0.595         0.210         0.099         0.046
  stat                        8         0.388         0.082         0.048         0.027    ENOENT: 3
  getdents                    4         0.361         0.138         0.090         0.044
  rt_sigprocmask              3         0.141         0.059         0.047         0.040
  write                       1         0.101         0.101         0.101         0.101
  dup2                        3         0.090         0.032         0.030         0.026
  arch_prctl                  1         0.077         0.077         0.077         0.077
  getrlimit                   1         0.062         0.062         0.062         0.062
  getcwd                      1         0.044         0.044         0.044         0.044
  set_robust_list             1         0.035         0.035         0.035         0.035
  set_tid_address             1         0.032         0.032         0.032         0.032
  setpgid                     1         0.030         0.030         0.030         0.030
  ---------------

  Program Executed: /opt/gitlab/embedded/bin/git
  Args: ["--git-dir" "/nfs/gitlab/gitdata/repositories/group/project.git" "cat-file" "--batch-check"]

  Parent PID:  3563

  Slowest file open times for PID 24670:

    dur (ms)       timestamp            error         file name
  ----------    ---------------    ---------------    ---------
    5203.999    13:45:16.152985           -           /efs/gitlab/home/.gitconfig
       5.420    13:45:16.143520           -           /nfs/gitlab/gitdata/repositories/group/project.git/config
       2.959    13:45:21.372776           -           /efs/gitlab/home/.gitconfig
       2.934    13:45:21.401073           -           /nfs/gitlab/gitdata/repositories/group/project.git/refs/
       2.736    13:45:21.417333        ENOENT         /nfs/gitlab/gitdata/repositories/group/project.git/info/grafts
       2.683    13:45:21.421558           -           /nfs/gitlab/gitdata/repositories/group/project.git/objects/b7/ef5eba3a425af1e2a9cf6f51cb87454b6e1ad1
       2.430    13:45:21.407170        ENOENT         /nfs/gitlab/gitdata/repositories/group/project.git/objects/info/alternates
       0.992    13:45:21.420213        ENOENT         /nfs/gitlab/gitdata/repositories/group/project.git/shallow
       0.823    13:45:21.405535           -           /nfs/gitlab/gitdata/repositories/group/project.git/objects/pack
       0.275    13:45:21.380382           -           /nfs/gitlab/gitdata/repositories/group/project.git/config
```

The output shows the time this process spent working was dominated by the slow file open. This data points the Support team in the right direction for fixing the underlying issue.
{: .note.text-center}

158 159
Strace itself has the `-c` flag which provides a similar summary, but its utility is limited when multiple processes are traced as it cannot break out per-process statistics.  Strace-Parser breaks these down to the PID level, and can also include the details of parent and child processes on demand.

Sara Kassabian's avatar
Sara Kassabian committed
160 161 162 163
“In this case Will has identified an interesting area for our customer and then very quickly anchored it in the fact that when we look at that one spot it was slow,” says Lee. “When we’re debugging, having this data available really helps us pinpoint the problem for our customers so we can give them answers.”

The typical GitLab deployment has many different processes and services running at a time, which can create dozens of different child processes, so there is a large surface area for potential errors or slowness to occur.

164
Strace-Parser is an open source, generic tool that anyone can use to better understand their strace data.
Sara Kassabian's avatar
Sara Kassabian committed
165

166
## 2. [JSON Stats](https://gitlab.com/gitlab-com/support/toolbox/json_stats)
Sara Kassabian's avatar
Sara Kassabian committed
167 168 169 170 171 172 173 174 175 176 177 178 179

Will also built [JSON Stats](https://gitlab.com/gitlab-com/support/toolbox/json_stats), a script that pulls performance statistics for different logs from the customer’s GitLab environment and summarizes the results in an easy-to-interpret table.

```
METHOD                             COUNT     RPS     PERC99     PERC95     MEDIAN         MAX        MIN          SCORE    % FAIL
FetchRemote                         2542    0.17  962176.08  130154.88   36580.23  4988513.00    1940.45  2445851585.19      1.06
FindAllTags                         5200    0.34   30000.37   11538.63    1941.84    30006.23     252.10   156001924.68      1.63
FindCommit                          3506    0.23   20859.98   16622.78   10841.86    30001.59    2528.67    73135073.75      0.23
FindAllRemoteBranches               1664    0.11   20432.93   12996.75    8606.60   405503.94    1430.84    34000396.10      0.00
AddRemote                           2603    0.17   10001.03    8094.97     825.46    10007.46     228.13    26032673.70      3.00
FindLocalBranches                   2535    0.16   10004.68   10002.90    9051.91    10036.16    1260.89    25361871.05     34.32
```

180
This output shows that we’re calling the “FindLocalBranches” service 2500+ times, and it’s failing 34% of the time.
Sara Kassabian's avatar
Sara Kassabian committed
181 182
{: .note.text-center}

183
The Support team can use JSON Stats to ground their findings in evidence when evaluating overall performance for a customer. It's the same concept as strace-parser. Can we pivot the information in a way that it clearly becomes meaningful data?
Sara Kassabian's avatar
Sara Kassabian committed
184

185
“It’s a quick way of extracting data that you can give to a customer. Instead of saying ‘Look, this failed once,’ we can say, ‘Look, this is failing a third of the time and that suggests there’s a problem with X,’” says Will.
Sara Kassabian's avatar
Sara Kassabian committed
186 187 188 189 190 191 192 193 194 195 196

In the sample output we see that JSON Stats is working with Gitaly logs, but the tool is nimble enough to work on the logs from all the heavy components of GitLab, including Rails, which runs the UI, and Sidekiq, which works on background tasks.

“Some of our customers are very sophisticated and may have advanced monitoring that could give us this information. But we wanted to build a tool that would help us align and easily standardize on how we can get this performance information for customers that don’t have an advanced monitoring setup,” says Lee.

While this specific tool isn't as helpful for people outside of the GitLab community, hopefully it helps to inspire others to consider how they are drawing conclusions, and how they can speed that process up.

### Benchmarking with JSON Stats

Will is building a future iteration of JSON Stats that will compare the performance of a customer’s GitLab instance with GitLab.com.

197
![JSON benchmarking table](/images/blogimages/support-tools-update.png){: .shadow}
Sara Kassabian's avatar
Sara Kassabian committed
198 199 200 201 202 203 204 205

Benchmarking the performance of GitLab.com (the first row) with the customer environment (second row), and the ratio between the two (third row). We can see that in the worst case, the customer’s 99th percentile FindCommit latency was almost eight times slower than it was on GitLab.com.
{: .note.text-center}

“Our vision here is to give accountability to our customers. We’re going to treat GitLab.com as the pinnacle experience for GitLab,” says Lee. “We want to use JSON Stats with benchmarking to help us understand how far away our customers are from GitLab.com.”

Lee and Will are still assessing how to set the target range for the customer’s instance of GitLab. But considering the wealth of resources allocated to GitLab.com, any self-managed customer that is performing within 5-10% of GitLab.com would be considered hugely successful.

206
## 3. [GitLab SOS](https://gitlab.com/gitlab-com/support/toolbox/gitlabsos)
Sara Kassabian's avatar
Sara Kassabian committed
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228

When a customer encounters an issue, but they are unsure of what they problem is, they can run [GitLab SOS](https://gitlab.com/gitlab-com/support/toolbox/gitlabsos), created by support engineer [Cody West](/company/team/#codyww), to create a snapshot of different activities happening on their system. It's been so helpful in debugging GitLab that it's being added into our [Omnibus delivery](https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/3430).



By capturing so much data about a moment in time during or shortly after encountering a problem, the support team is able to work asynchronously to troubleshoot on behalf of the customer.

```
cpuinfo              getenforce           iotop                netstat              opt                  sestatus             unicorn_stats
df_h                 gitlab_status        lscpu                netstat_i            pidstat              systemctl_unit_files uptime
dmesg                gitlabsos.log        meminfo              nfsiostat            ps                   tainted              var
etc                  hostname             mount                nfsstat              sar_dev              ulimit               vmstat
free_m               iostat               mpstat               ntpq                 sar_tcp              uname
```

GitLab SOS works best if the script is run while an issue is occurring, or moments after, but even if the window of opportunity is missed you can still successfully gather information to diagnose the problem.
{: .note.text-center}

“If a customer is sharp, they may know what problems to look for already,” says Lee. “But if a customer is scared and they don’t know what to look for, then they can lean on a tool like GitLab SOS and learn from GitLab SOS. We even have some sharp customers that will generate the SOS output and begin to troubleshoot themselves because of the comprehensive overview it provides.”

## These new tools drive data-driven decision-making in Support

229
Tools like strace-parser, JSON Stats, and GitLab SOS provide the Support team and GitLab customers with critical evidence about performance. By letting the data drive decision-making, the Support team is able to identify problems faster and quickly start debugging customer environments. Performance is a key feature of GitLab, and by filling our toolbox with data-driven solutions we can ensure greater [transparency](/handbook/values/#transparency) between GitLab and our customers.
Sara Kassabian's avatar
Sara Kassabian committed
230 231 232 233 234 235 236 237 238

Learn more about debugging from a support engineering perspective in a GitLab Unfiltered video.

<figure class="video_container">
  <iframe src="https://www.youtube.com/embed/9W6QnpYewik" frameborder="0" allowfullscreen="true"> </iframe>
</figure>

Cover photo by [Diogo Nunes](https://unsplash.com/@dialex?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/tools?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)
{: .note}