Investigate performance regression of show.json action (source) for Projects::BlobController
Recently the performance of the show.json action (source) for Projects::BlobController has degraded notably from around 1s up to 3s:
* Environment: 10k
* Environment Version: 15.4.0-pre `b5367e7add2`
* Option: 60s_200rps
* Date: 2022-09-06
* Run Time: 1h 38m 30.8s (Start: 04:43:51 UTC, End: 06:22:21 UTC)
* GPT Version: v2.11.0
NAME | RPS | RPS RESULT | TTFB AVG | TTFB P90 | REQ STATUS | RESULT
---------------------------------------------------------|-------|----------------------|-----------|----------------------|----------------|---------
web_project_file_source | 20/s | 12.09/s (>1.60/s) | 1708.70ms | 2901.08ms (<1700ms) | 100.00% (>99%) | FAILED¹²
http_req_waiting..................: avg=1708.70ms min=407.46ms med=2703.92ms max=4775.96ms p(90)=2901.08ms p(95)=3506.09ms
✗ { endpoint:file?format=json }...: avg=2919.21ms min=2700.16ms med=2824.63ms max=4775.96ms p(90)=3502.53ms p(95)=3587.95ms
✓ { endpoint:file }...............: avg=498.20ms min=407.46ms med=440.99ms max=3586.27ms p(90)=501.03ms p(95)=525.99ms
Investigations so far suggest this is related to more blame information being shown on the page but after some changes the endpoint still has degraded performance and more investigation is required to bring the endpoint back to similar levels.
Based on our performance targets the endpoint is currently performing in the severity2 tier.
Resources
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- 🤖 GitLab Bot 🤖 added sectiondev label
added sectiondev label
- Grant Young mentioned in issue #247878
mentioned in issue #247878
- Grant Young marked this issue as blocking #247878
marked this issue as blocking #247878
- Grant Young mentioned in merge request gitlab-org/quality/performance!451 (merged)
mentioned in merge request gitlab-org/quality/performance!451 (merged)
- Author Developer
Looking into this further on our end the page has gone through some changes recently with a switch to Highlight.js in %15.2.
The page now no longer calls the endpoint with the
?format.json
parameter but the team tells me this is dynamic in part and some files will still trigger the old method, estimated to be around 8%.Removing this endpoint from our test returns it to levels seen in #247878 but we have a bit of a quandary here as some users will still see the old behaviour and our tests did pick up this performance slip with the
format
method.So bit of a catch 22 here. Do we switch the test to match the majority or still include the format method to catch slips on that end? Or do we split the tests to cover both separately perhaps?
1 Collapse replies - Maintainer
@grantyoung I'm not sure what the maintenance costs are for these tests but I think it makes sense to split the tests to cover both cases separately. IMO think we'd still want to keep the 8% of cases happy while we try to reduce the number even further by adding more syntax highlighting support to HLJS. Eventually, we'll probably drop the legacy fallback and at that point, we can remove the test coverage for the
...?format.json
endpoint.Do you think this approach can work or is it not worth the maintenance cost?
- Author Developer
The maintenance cost wouldn't be significant here but in effect our test is already doing this.
It's a bit unorthodox but the test as it stands is calling both endpoints already and that's why it's picked up the degradation here. It's a bit unusual in that we typically design the tests to only call endpoints that the page would but this is a bit of a unique situation and we want to test the older endpoint still on a file we know is higher percentile (and preferably on a file we have history with)
So I think we can have a slight exception here in our design and just keep it as is until such a time when the endpoint usage is reduced to a negligible amount?
- Developer
@grantyoung I agree with Jacques that the best is to keep tests separate otherwise is hard to track which of the two is causing performance drops (or improvements :)). Or am I missing something?
- Author Developer
Ah the test we have is able to track both so it's effectively the same:
http_req_waiting..................: avg=1708.70ms min=407.46ms med=2703.92ms max=4775.96ms p(90)=2901.08ms p(95)=3506.09ms ✗ { endpoint:file?format=json }...: avg=2919.21ms min=2700.16ms med=2824.63ms max=4775.96ms p(90)=3502.53ms p(95)=3587.95ms ✓ { endpoint:file }...............: avg=498.20ms min=407.46ms med=440.99ms max=3586.27ms p(90)=501.03ms p(95)=525.99ms
The only difference is that the throughput is halved between the two but that's probably a good thing considering that there's now a dynamic element.
- Author Developer
Just to confirm we'll just keep the test as is for us to monitor any fixes.
- 🤖 GitLab Bot 🤖 added priority2 label
added priority2 label
- Author Developer
Just wanted to call out today we were seeing a further slight drop in performance for this endpoint today on several test environments:
* Environment: 10k * Environment Version: 15.5.0-pre `1d47d809080` * Option: 60s_200rps * Date: 2022-10-10 * Run Time: 1h 36m 2.82s (Start: 04:40:32 UTC, End: 06:16:34 UTC) * GPT Version: v2.11.0 NAME | RPS | RPS RESULT | TTFB AVG | TTFB P90 | REQ STATUS | RESULT ---------------------------------------------------------|-------|----------------------|-----------|-----------------------|----------------|--------- web_project_file_source | 20/s | 11.28/s (>6.40/s) | 1805.97ms | 3091.62ms (<3750ms) | 100.00% (>99%) | FAILED¹² http_req_waiting..................: avg=1805.97ms min=398.20ms med=2852.30ms max=5477.24ms p(90)=3091.62ms p(95)=3761.07ms ✗ { endpoint:file?format=json }...: avg=3147.24ms min=2851.91ms med=3012.73ms max=5477.24ms p(90)=3761.59ms p(95)=3875.68ms ✓ { endpoint:file }...............: avg=464.69ms min=398.20ms med=431.80ms max=2964.27ms p(90)=477.08ms p(95)=490.10ms
We're only talking about a few further hundred milliseconds though.
- Vasilii Iakliushin changed milestone to %Backlog
changed milestone to %Backlog
- 🤖 GitLab Bot 🤖 added [deprecated] Accepting merge requests label
added [deprecated] Accepting merge requests label
- Maintainer
@sean_carroll, @tlinz and @a_mcdonald,This groupsource code bug has at most 25% of the SLO duration remaining and is ~"approaching-SLO" breach. Please consider taking action before this becomes a ~"missed-SLO" in 14 days (2022-11-05).
- 🤖 GitLab Bot 🤖 added SLONear Miss label
added SLONear Miss label
- Developer
Added to %15.7 planning
Collapse replies - Developer
Is this frontend or backend or both? It doesn't show up in the planning board for backend.
cc: @a_mcdonald, @andr3
Edited by Torsten Linz 1 - Developer
- Contributor
- Sean Carroll changed milestone to %15.7
changed milestone to %15.7
- Anastasia McDonald mentioned in epic &6592 (closed)
mentioned in epic &6592 (closed)
- 🤖 GitLab Bot 🤖 added SLOMissed label
added SLOMissed label
- Anastasia McDonald mentioned in issue create-stage#13078 (closed)
mentioned in issue create-stage#13078 (closed)
- André Luís added backend label
added backend label
- Sean Carroll changed milestone to %Backlog
changed milestone to %Backlog
- Anastasia McDonald changed milestone to %15.9
changed milestone to %15.9
- Maintainer
Setting label(s) Category:Source Code Management based on groupsource code.
- 🤖 GitLab Bot 🤖 added Category:Source Code Management label
added Category:Source Code Management label
- Developer
Moving to %15.10 planning
- Sean Carroll changed milestone to %15.10
changed milestone to %15.10
- Developer
Estimating backend-weight3
- Sean Carroll set weight to 3
set weight to 3
- Contributor
Moving to %15.11 due to BE being over capacity this milestone
- Anastasia McDonald changed milestone to %15.11
changed milestone to %15.11
- Sean Carroll changed the description
Compare with previous version changed the description
- Developer
From Grafana
- Contributor
Moving to %16.0 due to a different priority focus this quarter
- Anastasia McDonald changed milestone to %16.0
changed milestone to %16.0
- Anastasia McDonald changed milestone to %16.1
changed milestone to %16.1
- Sean Carroll mentioned in issue create-stage#13149 (closed)
mentioned in issue create-stage#13149 (closed)
- Developer
Moving to %16.2 due to lowered capacity.
- Sean Carroll changed milestone to %16.2
changed milestone to %16.2
- Jay McCure added candidate16.2 label
added candidate16.2 label
- Jay McCure mentioned in issue create-stage#13153 (closed)
mentioned in issue create-stage#13153 (closed)
- Maintainer
@grantyoung according to these stats,
web_project_file_source
performs within ~1600msAccording to Graphana, it's back under 1s again:
Do you think we can downgrade the Priority/Severity here? Could you please have a look?
/cc @sean_carroll
1 Collapse replies - Developer
@igor.drozdov Please see GPT results for the test (it's also linked in https://gitlab.com/gitlab-org/quality/performance/-/wikis/Benchmarks/Latest/10k?version_id=ce5bda38704a9ebd699ccd7e59a86570bc272588 asResults History Dashboard
) for the latest information (LCP/TBT targets are tracked separately in browser performance pipeline which is different from GPT performance pipeline)Unfortunately the above graph doesn't represent yet in-depth endpoint results, so might be misleading for web tests specificially because they are calling multiple endpoints. To check actual endpoint results, we need to go to k6 pipeline - for example https://gitlab.com/gitlab-org/quality/performance/-/jobs/4460977499#L4484 (search for
web_project_file_source
)█ setup █ Web - Project File Source data_received.....................: 4.8 GB 78 MB/s data_sent.........................: 28 MB 461 kB/s group_duration....................: avg=1933.70ms min=901.84ms med=1798.86ms max=4144.74ms p(90)=2834.21ms p(95)=2972.95ms http_req_blocked..................: avg=0.10ms min=0.00ms med=0.00ms max=20.48ms p(90)=0.00ms p(95)=0.00ms http_req_connecting...............: avg=0.01ms min=0.00ms med=0.00ms max=0.76ms p(90)=0.00ms p(95)=0.00ms http_req_duration.................: avg=761.04ms min=285.56ms med=860.28ms max=3299.84ms p(90)=1570.68ms p(95)=1985.35ms { expected_response:true }......: avg=761.04ms min=285.56ms med=860.28ms max=3299.84ms p(90)=1570.68ms p(95)=1985.35ms http_req_failed...................: 0.00% ✓ 0 ✗ 1156 http_req_receiving................: avg=16.60ms min=0.19ms med=13.96ms max=107.22ms p(90)=35.88ms p(95)=47.63ms http_req_sending..................: avg=0.10ms min=0.04ms med=0.09ms max=5.02ms p(90)=0.13ms p(95)=0.14ms http_req_tls_handshaking..........: avg=0.07ms min=0.00ms med=0.00ms max=7.32ms p(90)=0.00ms p(95)=0.00ms http_req_waiting..................: avg=744.34ms min=284.77ms med=830.62ms max=3268.80ms p(90)=1569.89ms p(95)=1942.10ms ✓ { endpoint:file?format=json }...: avg=1077.55ms min=786.93ms med=887.36ms max=3268.80ms p(90)=1942.26ms p(95)=2028.06ms ✓ { endpoint:file }...............: avg=411.14ms min=284.77ms med=326.01ms max=1907.71ms p(90)=400.94ms p(95)=1273.33ms ✓ http_reqs.........................: 1156 19.043845/s ✓ { endpoint:file?format=json }...: 578 9.521923/s ✓ { endpoint:file }...............: 578 9.521923/s iteration_duration................: avg=1930.40ms min=0.15ms med=1798.41ms max=4144.78ms p(90)=2834.08ms p(95)=2972.60ms iterations........................: 578 9.521923/s ✓ successful_requests...............: 100.00% ✓ 1156 ✗ 0 vus...............................: 4 min=4 max=20 vus_max...........................: 20 min=20 max=20
or https://gitlab.com/gitlab-org/quality/performance/-/jobs/4443899500#L4483
█ setup █ Web - Project File Source data_received.....................: 4.8 GB 78 MB/s data_sent.........................: 28 MB 460 kB/s group_duration....................: avg=1944.03ms min=949.13ms med=1804.93ms max=4476.06ms p(90)=2869.76ms p(95)=3099.57ms http_req_blocked..................: avg=0.09ms min=0.00ms med=0.00ms max=22.29ms p(90)=0.00ms p(95)=0.00ms http_req_connecting...............: avg=0.01ms min=0.00ms med=0.00ms max=0.83ms p(90)=0.00ms p(95)=0.00ms http_req_duration.................: avg=819.92ms min=327.54ms med=890.36ms max=4272.09ms p(90)=1564.19ms p(95)=2117.07ms { expected_response:true }......: avg=819.92ms min=327.54ms med=890.36ms max=4272.09ms p(90)=1564.19ms p(95)=2117.07ms http_req_failed...................: 0.00% ✓ 0 ✗ 1156 http_req_receiving................: avg=16.59ms min=0.26ms med=16.14ms max=84.74ms p(90)=36.51ms p(95)=46.58ms http_req_sending..................: avg=0.10ms min=0.04ms med=0.11ms max=0.32ms p(90)=0.14ms p(95)=0.15ms http_req_tls_handshaking..........: avg=0.06ms min=0.00ms med=0.00ms max=9.20ms p(90)=0.00ms p(95)=0.00ms http_req_waiting..................: avg=803.23ms min=326.87ms med=861.28ms max=4246.09ms p(90)=1550.83ms p(95)=2083.75ms ✓ { endpoint:file?format=json }...: avg=1131.54ms min=809.29ms med=933.44ms max=4246.09ms p(90)=2021.86ms p(95)=2232.54ms <======= over 2s ✓ { endpoint:file }...............: avg=474.91ms min=326.87ms med=385.58ms max=2587.69ms p(90)=490.71ms p(95)=929.87ms ✓ http_reqs.........................: 1156 19.021573/s ✓ { endpoint:file?format=json }...: 578 9.510786/s ✓ { endpoint:file }...............: 578 9.510786/s iteration_duration................: avg=1940.71ms min=0.19ms med=1803.96ms max=4476.09ms p(90)=2869.22ms p(95)=3099.17ms iterations........................: 578 9.510786/s ✓ successful_requests...............: 100.00% ✓ 1156 ✗ 0 vus...............................: 4 min=4 max=20 vus_max...........................: 20 min=20 max=20
As seen
endpoint:file?format=json
is slightly over 2s still periodically and usually 1900ms. Would it be possible to take a look if something can be improved to ensure endpoint is consistent under 2s as part of this investigation for severity2 ?If not, the severity can be updated to severity3. WDYT?
- Maintainer
@niskhakova thanks for having a look!
Would it be possible to take a look if something can be improved to ensure endpoint is consistent under 2s as part of this investigation for severity2 ?
Hm, I can't think of any quick wins here, and according to the description of Improve performance of show.json action (source... (#247878) it requires additional effort to optimize this endpoint further
While the endpoint is still performing in the ~"severity::2" tier the dev team has reported that any further gains will need substantial solutions and that this would be better done in a separate issue. Since the endpoint is just outside of the severity3 tier a small exception is made here due to this and this task is to get the endpoint down to the severity4 tier of <1000ms.
It seems that this current issue has been created only to investigate the spike
Since it's back to normal, I think we can close it and continue with #247878, WDYT? - Developer
Oh right, this is the spike issue indeed. Sounds good to me to close the issue since it's back to its original performance
Do you know what might have fixed this?
Just curious, if it's difficult to pinpoint feel free to ignore, a lot of changes happen - Maintainer
@niskhakova to be honest, I'm not entirely sure
But I see a couple of discussions in the original issue related to the spikes:So this issue is likely to be related to one (or both) of those cases
1
- Igor Drozdov closed
closed
- Nailia Iskhakova mentioned in merge request gitlab-org/quality/performance!511 (merged)
mentioned in merge request gitlab-org/quality/performance!511 (merged)
- 🤖 GitLab Bot 🤖 mentioned in issue gitlab-org/quality/triage-reports#13269 (closed)
mentioned in issue gitlab-org/quality/triage-reports#13269 (closed)
- John McDonnell mentioned in issue gitlab-org/quality/quality-engineering/team-tasks#3411
mentioned in issue gitlab-org/quality/quality-engineering/team-tasks#3411