Skip to content

Creating project times out when group has many members

Summary

Creating projects in a group with many members times out.

Steps to reproduce

  • Have a group with many members. We see this problem frequently in https://gitlab.com/gitlab-community which has (at the time of issue creation) 3975 members.
  • Create a project

Example Group

https://gitlab.com/gitlab-community

What is the current bug behavior?

The request takes very long or times out. In case of a timeout, actions like repository creation or repository imports are not triggered.

What is the expected correct behavior?

The request completes normally.

Relevant logs and/or screenshots

Example correlation ID: 01JNS3SZJY4MBAAQYMPWFVXV2B

image

If someone wants the json file from the performance bar, let me know. It is about 70MB, which is too big to be attached here or pasted into a private snippet. Provided in internal note #523919 (comment 2395846189)

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

From looking at the output of the performance bar, it seems like the deduplication of the AuthorizedProjectUpdate::UserRefreshFromReplicaWorker is the problem. The deduplication checks the current WAL location for every single job that gets enqueued.

The immediate problem could be fixed by moving the queuing of that worker into a worker to remove it from the request time. Given that the worker is enqueued with a bigger delay, I don't see a problem if it happens a short time later if it gets scheduled through another worker.

Edited by Niklas van Schrick