Improve performance of logs_tree action for Projects::RefsController under load into next tier

Related gitlab-org/quality/performance#224 (closed)

A user on HN, ddevault, reported that the project files viewing experience was slow. After a quick investigation this was verified. Our testing was missing out on a specific controller and action that is called on the file trees - Projects::RefsController#log_trees.json that we immediately sought to add.

Once the test was updated to include the controller and point at a larger directory (~1800 files) the testing showed that the controller is indeed performing notably under our target of 500ms.

█ Results summary

* Environment:                10k
* Environment Version:        12.10.0-pre `71c834efe31`
* Option:                     60s_200rps
* Date:                       2020-04-16
* Run Time:                   2m 8.45s (Start: 14:13:32 UTC, End: 14:15:40 UTC)
* GPT Version:                v1.2.6

NAME                            | RPS   | RPS RESULT           | TTFB AVG  | TTFB P90            | REQ STATUS     | RESULT
--------------------------------|-------|----------------------|-----------|---------------------|----------------|-----------------
api_v4_projects_repository_tree | 200/s | 195.57/s (>160.00/s) | 78.69ms   | 93.45ms (<500ms)    | 100.00% (>95%) | Passed
web_project_files               | 20/s  | 9.05/s (>16.00/s)    | 1898.75ms | 2674.71ms (<2000ms) | 100.00% (>95%) | FAILED³

Detailed Stats:
    █ Web - Project Files Tree

    data_received..............: 11 MB   190 kB/s
    data_sent..................: 176 kB  2.9 kB/s
    group_duration.............: avg=12175.10ms min=9955.84ms med=11863.04ms max=15312.28ms p(90)=14206.22ms p(95)=14596.49ms
    http_req_blocked...........: avg=0.05ms     min=0.00ms    med=0.01ms     max=5.57ms     p(90)=0.01ms     p(95)=0.04ms
    http_req_connecting........: avg=0.03ms     min=0.00ms    med=0.00ms     max=5.43ms     p(90)=0.00ms     p(95)=0.00ms
    http_req_duration..........: avg=1899.49ms  min=218.51ms  med=2114.44ms  max=2963.52ms  p(90)=2675.27ms  p(95)=2836.52ms
    http_req_receiving.........: avg=0.70ms     min=0.35ms    med=0.49ms     max=66.75ms    p(90)=0.68ms     p(95)=0.93ms
    http_req_sending...........: avg=0.04ms     min=0.02ms    med=0.04ms     max=0.14ms     p(90)=0.06ms     p(95)=0.08ms
    http_req_tls_handshaking...: avg=0.00ms     min=0.00ms    med=0.00ms     max=0.00ms     p(90)=0.00ms     p(95)=0.00ms
    http_req_waiting...........: avg=1898.75ms  min=217.88ms  med=2110.42ms  max=2963.04ms  p(90)=2674.71ms  p(95)=2836.00ms
    ✗ { endpoint:logs_tree }...: avg=2266.45ms  min=1854.54ms med=2216.62ms  max=2963.04ms  p(90)=2739.00ms  p(95)=2860.86ms
    ✓ { endpoint:tree }........: avg=269.84ms   min=217.88ms  med=256.50ms   max=366.94ms   p(90)=336.81ms   p(95)=349.85ms
  ✗ http_reqs..................: 543     9.049988/s
    ✗ { endpoint:logs_tree }...: 443     7.383323/s
    ✗ { endpoint:tree }........: 100     1.666664/s
    iteration_duration.........: avg=12024.83ms min=0.31ms    med=11839.29ms max=15312.30ms p(90)=14198.41ms p(95)=14587.16ms
    iterations.................: 80      1.333332/s
  ✓ successful_requests........: 100.00% ✓ 543  ✗ 0
    vus........................: 1       min=1  max=20
    vus_max....................: 20      min=20 max=20

This was tested on our 10k Reference Architecture with a RPS target of 20/s. It targets a path on one of two copies of a sanitised version of out gitlab-foss project that has a high number of files (app/assets/images/emoji - 1794 files). The equivalent API test is shown for context but it's performing within expectations as it's paginating.

Of note for this particular endpoint is that it dynamically loads in more logs as the user scrolls the page if there's many files. When this occurs the performance hit happens to the user essentially on each page scroll, degrading the experience further (you can see this live in effect here - https://gitlab.com/gitlab-org/gitlab/tree/master/app/assets/images/emoji).

As shown above the logs_tree action is performing badly here with a TTFB P90 of around 2.7s. As per our performance targets this issue falls into the ~s2 tier.

Edited by Grant Young