Skip to content

Draft: `GET /api/:version/groups/:id/projects` performance investigation

Chad Woolley requested to merge group-projects-api-perf into master

What does this MR do and why?

Describe in detail what your merge request does and why.

Performance investigation for groups -> projects API endpoint. See #330140 (comment 687343917)

Related to #330140 (closed)

Overview

Copied from Issue comment:


Status

  1. I've looked into this and gotten some context, and also some good advice/insight from others. I've also re-reviewed the info @mksionek provided on a similar issue in this thread, including her suggestion of watching this excellent video.
  2. It seems there aren't any obvious n+1 problems (TODO: we should update the issue title to reflect this), but it is a very complex request with a lot going on.
  3. There seem to be two paths available to improve performance:
    1. DB: One single slow query which takes up a majority of the request time in some cases (more info in this thread above.
      1. Note that there are dozens (hundreds?) of other queries being performed as part of some of these requests, but based on initial investigation, few or none of these seem to be slow (at least not longer than 1 second to show up in Kibana), and many of them are repeated and cached.
    2. CPU (Ruby/Rails): These requests can also spend several seconds in "view duration". Based on this thread above, it seems like the "view duration" for a Grape API call essentially means all "CPU" or non-DB time.

Next Steps

Following the good advice from the video, the next steps should be:

  1. Get a baseline (benchmark)
  2. Look for culprits
  3. Keep benchmarking

So, my specific plans are:

  1. Get a baseline and benchmark
    1. Some of the suggested tools only run on linux, so I've set up a Virtualbox linux instance with a clean GDK install for this purpose.
    2. Then, create enough data to simulate the slowness we see on production. I plan to start with Gosia's example script here, but since this problem is related to nested groups (maybe deeply, maybe widely?), it will need to also be able to create that structure as well.
    3. Then, we can use the appropriate tools to get a repeatable and reliable benchmark.
  2. Look for culprits
    1. As far as the slow DB query, I am not a DBA nor a postgres expert (my recent years of experience are primarily with mysql), and this is a complex query. So, once I have a reproducible benchmarked local environment for the slow query, I will probably reach out to other team members with more DBA experience for any ideas on optimizing this query.
    2. As far as the CPU/Ruby ("View duration") slowness, I will try the various tools suggested in the video to try to narrow down any bottlenecks in the Ruby code, and see if there are any possible optimizations I can identify.

NOTES

Performance Investigation Talk

Tools

Steps

NOTE: These plans may change based on information we find, but will not necessarily be kept up to date.

  1. Set up a linux virtual machine with GDK (some of the tools only work in linux)
  2. Write a script to create a big hierarchy of groups and projects (TODO: We could have used FactoryBot and with_hierarchy for this).
  3. Get an initial benchmark with benchmark-ips
  4. Start looking for culprits
  5. Try to fix any suspicious culprits
  6. Always Be Benchmarking
  7. Profit.

Benchmarking Challenges

It's difficult to benchmark a long-running request in a reliable, repeatable way, especially in a local environment, and especially if the dev machine has thermal throttling issues. See more details in this thread: !71246 (comment 690466371)

RubyProf

Note on profiling in dev

In the development environment, there are several things enabled which confound the results. For example, the Bullet N+1 query detector accounts for a significant percentage of the request time. This must be explicitly disabled (with a code hack, there's no ENV var to override in dev) in order to eliminate this from the request profiling.

RubyProf Results

RubyProf Run 01

Setup:

  1. Local dev environment
  2. Bullet disabled by forcing lib/gitlab/bullet.rb#configure_bullet? to return false
  3. A single request against a group hierarchy 3 deep, 3 wide, with 2 projects per group, with the following HTTP request (RubyMine HTTP Request Format):
GET http://gdk.test:3000/api/v4/groups/test6/projects?include_subgroups=true
Accept: application/json
Content-Type: application/json
PRIVATE-TOKEN: XXX

log/api_json.log entry:

{"time":"2021-10-01T22:29:38.673Z","severity":"INFO","duration_s":5.61192,"db_duration_s":0.49877,"view_duration_s":5.11315,"status":200,"method":"GET","path":"/api/v4/groups/test6/projects","params":[{"key":"include_subgroups","value":"true"}],"host":"gdk.test","remote_ip":"127.0.0.1","ua":"Apache-HttpClient/4.5.13 (Java/11.0.12)","route":"/api/:version/groups/:id/projects","user_id":1,"username":"root","queue_duration_s":1.921156,"gitaly_calls":20,"gitaly_duration_s":0.906728,"redis_calls":24,"redis_duration_s":0.053672000000000004,"redis_read_bytes":11130,"redis_write_bytes":15512,"redis_cache_calls":23,"redis_cache_duration_s":0.051346,"redis_cache_read_bytes":11130,"redis_cache_write_bytes":15464,"redis_shared_state_calls":1,"redis_shared_state_duration_s":0.002326,"redis_shared_state_write_bytes":48,"db_count":224,"db_write_count":0,"db_cached_count":114,"db_replica_count":0,"db_replica_cached_count":0,"db_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_count":224,"db_primary_cached_count":114,"db_primary_wal_count":0,"db_primary_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":1.164,"cpu_s":5.766931,"pid":79417,"correlation_id":"01FGYZM4C1KG7RCH9CQ1SKZVA5","meta.user":"root","meta.caller_id":"GET /api/:version/groups/:id/projects","meta.remote_ip":"127.0.0.1","meta.feature_category":"subgroups","meta.client_id":"user/1"}

Profile HTML file: callstack-01.html

Edited by Chad Woolley

Merge request reports