Single Issue Details API degrades substantially in certain circumstances

Summary

Through performance testing effort in the Quality team we've been consistently yet sporadically seeing a significant performance degradation with the Single Issue API.

Pinning it down has proven tricky however and as mentioned it isn't happening on all of our environments as well as even happening consistently on the ones that have shown it.

When it does happen though the performance drop is significant and concerning. It was first seen on our test 25k reference architecture environment where we saw a massive increase in response time from around 200ms to over 10s:

* Environment:                25k
* Environment Version:        12.10.0-pre `8fc3e503895`
* Option:                     60s_500rps
* Date:                       2020-04-07
* Run Time:                   55m 23.06s (Start: 10:00:44 UTC, End: 10:56:07 UTC)
* GPT Version:                v1.2.6
➤ Overall Results Score: 94.72%
NAME                                                     | RPS   | RPS RESULT           | TTFB AVG  | TTFB P90            | REQ STATUS     | RESULT 
---------------------------------------------------------|-------|----------------------|-----------|---------------------|----------------|--------
api_v4_projects_issues_issue                             | 500/s | 49.95/s (>400.00/s)  | 8436.26ms | 10195.94ms (<500ms) | 100.00% (>95%) | FAILED³

We investigated to see if it was an environment issue as it wasn't appearing on any other of our test environments that are set up in the same manner. It stopped happening one day after an upgrade and we decided to leave it there unless it reappeared.

The issue has since done that and reappeared. The results shown above are on our 25k reference architecture that was rebuilt from scratch yesterday. In addition we're now seeing this on one of our test GitLab docker images (built from the official images with some seeded data) that we use for performance comparison testing, specifically the one we build for 12.9.0 that coincides with when we started to see the issue on the 25k environment:

NAME                                                     | TARGET | 12-5-0        | 12-6-0-EE                | 12-7-0                   | 12-8-0                 | 12-9-0                  
---------------------------------------------------------|--------|---------------|--------------------------|--------------------------|------------------------|-------------------------
api_v4_projects_issues_issue                             | 500ms  | 244.63ms (-)  | 258.88ms (_+2.85%_)      | 217.85ms (**-8.21%**)    | 207.33ms (**-2.1%**)   | 2134.71ms (_+385.48%_) 

In both cases it appears to be that the Rails workers are the ones being hit with their CPU usage spiking heavily:

Screenshot_2020-04-07_at_16.49.01

Logs show though that the database is taking some time to respond also:

{"time":"2020-04-07T10:06:44.583Z","severity":"INFO","duration":9089.38,"db":5421.65,"view":3667.7299999999996,"status":200,"method":"GET","path":"/api/v4/projects/qa-perf-testing%2Fgitlabhq/issues/4218","params":[],"host":"25k.testbed.gitlab.net","remote_ip":"10.142.0.6, 127.0.0.1","ua":"GPT/1.2.6","route":"/api/:version/projects/:id/issues/:issue_iid","user_id":1,"username":"root","queue_duration":7929.57,"gitaly_calls":10,"gitaly_duration":1025.17,"redis_calls":1,"redis_duration_ms":63.16,"correlation_id":"UTcEd5TGUL5"}

Further investigations were done on our first environment investigation issue as well.

In summary it appears something has changed in 12.9.0 to this area that causes it's performance to drop significantly. Anecdotally this appears to happen after upgrading from a previous version.

As part of the new performance targets this issue would be categorised strictly as a ~S1 as it goes over 9000ms. However this issue is sporadic and doesn't occur often so this will be raised as a ~S3.

Steps to reproduce

Our GitLab docker image, being static, is available and should always show the issue now. You can run and test as follows:

  1. Download and run the specific Quality Performance made GitLab docker image for version 12.9.0 like you would the official one, e.g. docker run -d --publish 80:80 --publish 443:443 --publish 22:22 --name gitlab --hostname gitlab registry.gitlab.com/gitlab-org/quality/performance-images/gitlab-ce-performance:12.9.0-ce.0
  • To access, if you already have a hostname alias setup for the Docker image you can use that or the standard 127.0.0.1. Password is the standard local password we use.
  • After the container has started you'll then need to run the GitLab Performance Tool to run the specific test. To make this as seemless as possible we're providing a docker image to do this which you can run as follows - docker run --net=host -it -e ACCESS_TOKEN=<TOKEN> -e ENVIRONMENT_URL=http://gitlab -v $PWD/k6/results:/results registry.gitlab.com/gitlab-org/quality/performance/gitlab-performance-tool:test -e docker.json -t api_v4_projects_issues_issue. To find out the access token please conteact either myself or @niskhakova.
  • After the test has run you should see it reporting over 2 seconds typical TTFB P90 response time. E.g.:
* Environment:                Docker
* Environment Version:        12.9.0 `9a382ff2c82`
* Option:                     20s_2rps
* Date:                       2020-04-07
* Run Time:                   21.39s (Start: 16:15:44 UTC, End: 16:16:06 UTC)
* GPT Version:                v1.2.6

❯ Overall Results Score: 37.5%

NAME                         | RPS | RPS RESULT       | TTFB AVG  | TTFB P90           | REQ STATUS     | RESULT
-----------------------------|-----|------------------|-----------|--------------------|----------------|-----------------
api_v4_projects_issues_issue | 2/s | 0.75/s (>1.60/s) | 2019.52ms | 2267.64ms (<500ms) | 100.00% (>95%) | FAILED³

Example Project

Performance testing is all done with a sanitised gitlab-foss project. You can find this here - https://gitlab.com/gitlab-org/quality/performance-data/-/blob/master/projects_export/gitlabhq_export.tar.gz.

What is the current bug behavior?

The Single Issue API endpoint's performance degrades substantially in certain unclear circumstances.

What is the expected correct behavior?

The Single Issue API endpoint performs consistently.

Relevant logs and/or screenshots

Stackprof profile from previous investigation:

==================================
  Mode: wall(1000)
  Samples: 453 (8.48% miss rate)
  GC: 242 (53.42%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       193  (42.6%)         193  (42.6%)     (marking)
        49  (10.8%)          49  (10.8%)     (sweeping)
        33   (7.3%)          25   (5.5%)     ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_no_cache
        12   (2.6%)          12   (2.6%)     Nokogiri::XML::Document#initialize
         9   (2.0%)           9   (2.0%)     Nokogiri::XML::Document#decorate
         8   (1.8%)           8   (1.8%)     ActiveModel::Attribute#initialize
         5   (1.1%)           5   (1.1%)     Concurrent::Collection::NonConcurrentMapBackend#get_or_default
         4   (0.9%)           4   (0.9%)     ActiveSupport::NumericWithFormat#to_s
         4   (0.9%)           4   (0.9%)     ActiveRecord::Result#hash_rows
        13   (2.9%)           4   (0.9%)     ActiveModel::LazyAttributeHash#[]
         3   (0.7%)           3   (0.7%)     Object#blank?
         3   (0.7%)           3   (0.7%)     block (3 levels) in class_attribute
         3   (0.7%)           3   (0.7%)     ActiveRecord::QueryMethods#check_if_method_has_arguments!
         3   (0.7%)           3   (0.7%)     ActiveRecord::Reflection::ClassMethods#_reflect_on_association
         3   (0.7%)           3   (0.7%)     Note#cached_markdown_fields
         3   (0.7%)           3   (0.7%)     ActiveRecord::Relation#initialize_copy
         3   (0.7%)           3   (0.7%)     Nokogiri::XML::XPathContext#register_namespaces
         3   (0.7%)           2   (0.4%)     ActiveRecord::DynamicMatchers#respond_to_missing?
         2   (0.4%)           2   (0.4%)     #<Module:0x00007f2a3edc5068>.subject_key
         2   (0.4%)           2   (0.4%)     ActiveRecord::ConnectionAdapters::Quoting#_quote
         2   (0.4%)           2   (0.4%)     ActiveRecord::PredicateBuilder#handler_for
         2   (0.4%)           2   (0.4%)     ActiveSupport::PerThreadRegistry#instance
         3   (0.7%)           2   (0.4%)     ActionDispatch::Journey::Path::Pattern#match
         2   (0.4%)           2   (0.4%)     Nokogiri::XML::Searchable#extract_params
         2   (0.4%)           2   (0.4%)     ActiveRecord::ConnectionHandling#connection_specification_name
         2   (0.4%)           2   (0.4%)     Mentionable#extractors
         2   (0.4%)           2   (0.4%)     EE::Note#for_epic?
         6   (1.3%)           2   (0.4%)     ActiveRecord::AttributeMethods::Read#_read_attribute
         2   (0.4%)           2   (0.4%)     ActiveModel::Type::String#cast_value
         2   (0.4%)           2   (0.4%)     GrapePathHelpers::NamedRouteMatcher#route_match?
Edited by Grant Young