Skip to content

Clear up stale environments

Overview

We should automatically clear up stale environments due to the performance impact they have on busy repositories. This is a major contributor to slowness on gitlab-com/www-gitlab-com and gitlab-org/gitlab.

The Problem

As discovered in #222247 (closed), environments are persisted as refs in the pool repository. Certain git commands tend to read every single ref, so with a few thousand stale environment refs in the repository a lot of commands slow down dramatically.

This is a summary of directories being accessed when performing git log on a copy of the www-gitlab-com pool repository from a couple of months back. This copy was grabbed from production by @stanhu which we then loaded onto a test server, and a trace was then gathered of git log using the following command:

strace -fttTyyy -s 1024 -o pool.trace git -C @pools/6b/86/6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b.git log

Directories accessed

Summarised using our super-cool strace-parser:

❯ strace-parser ../traces/pool.trace directories -s duration | head -n 50                                                                                                           11:32:48

Directories accessed for files

      pid      dur (ms)      first time          last time          open ct    directory name
  -------    ----------    ---------------    ---------------    ----------    --------------
   439489     14572.087    16:38:19.930060    17:24:36.367978        531277    .
   439489     14483.887    16:38:19.933964    17:24:20.892514        530466    ./refs
   439489     13078.097    16:44:55.051814    17:24:20.849024        477260    ./refs/remotes
   439489     13078.075    16:44:55.052231    17:24:20.849024        477259    ./refs/remotes/origin
   439489     11191.357    16:45:53.286244    17:22:29.256390        404550    ./refs/remotes/origin/environments
   439489      2803.834    16:47:54.689394    17:11:09.703747         92216    ./refs/remotes/origin/environments/production
   439489      2803.811    16:47:54.690005    17:11:09.703747         92215    ./refs/remotes/origin/environments/production/deployments
   439489      1405.458    16:38:20.009563    16:44:55.038956         53198    ./refs/dangling
   439489      1340.257    17:22:29.828930    17:24:20.439701         51161    ./refs/remotes/origin/merge-requests
   439489       276.339    16:45:18.696713    16:45:44.403715         11315    ./refs/remotes/origin/heads
   439489       182.524    17:17:02.602512    17:17:26.665644          7051    ./refs/remotes/origin/environments/staging
   439489       182.465    17:17:02.602993    17:17:26.665644          7050    ./refs/remotes/origin/environments/staging/deployments
   439489        87.978    16:38:19.934628    17:24:36.367978           804    ./objects
   439489        86.909    16:45:44.526967    16:45:53.139417          3318    ./refs/remotes/origin/pipelines
   439489        70.908    16:38:20.002377    17:24:33.024496           679    ./objects/pack
   439489        19.224    17:13:20.292959    17:13:20.332002             8    ./refs/remotes/origin/environments/review-remove-mm-snx3af
   439489        18.696    17:15:46.827908    17:15:48.201167           779    ./refs/remotes/origin/environments/review-release-13-2ke04s
   439489        18.663    17:12:36.161023    17:12:36.248326            32    ./refs/remotes/origin/environments/review-fkurniadi-zivfx1
   439489        18.638    17:15:46.828488    17:15:48.201167           778    ./refs/remotes/origin/environments/review-release-13-2ke04s/deployments
   439489        17.665    17:22:01.950412    17:22:02.719233           468    ./refs/remotes/origin/environments/review-sm-invento-ouxvxi
   439489        17.627    17:22:01.950832    17:22:02.719233           467    ./refs/remotes/origin/environments/review-sm-invento-ouxvxi/deployments
   439489        16.678    17:14:22.689576    17:14:23.956135           721    ./refs/remotes/origin/environments/review-release-13-ne30vp
   439489        16.610    17:14:22.690432    17:14:23.956135           720    ./refs/remotes/origin/environments/review-release-13-ne30vp/deployments
   439489        13.668    17:13:55.642183    17:13:55.897390            62    ./refs/remotes/origin/environments/review-handbook-p-wc99xn
   439489        12.647    17:14:09.238043    17:14:10.194415           529    ./refs/remotes/origin/environments/review-dotcom-cat-2ch6ww
   439489        12.597    17:14:09.239200    17:14:10.194415           528    ./refs/remotes/origin/environments/review-dotcom-cat-2ch6ww/deployments
   439489        10.299    17:15:41.987349    17:15:42.583203           421    ./refs/remotes/origin/environments/review-span-of-in-lgio2x
   439489        10.219    17:15:41.988145    17:15:42.583203           420    ./refs/remotes/origin/environments/review-span-of-in-lgio2x/deployments
   439489         9.473    17:14:03.657118    17:14:04.215349           286    ./refs/remotes/origin/environments/review-julia-lake-v7g6k6
   439489         9.429    17:14:03.657647    17:14:04.215349           285    ./refs/remotes/origin/environments/review-julia-lake-v7g6k6/deployments
   439489         9.273    17:12:16.331148    17:12:17.020067           370    ./refs/remotes/origin/environments/review-whaber-oct-nurdra
   439489         9.250    17:12:16.331864    17:12:17.020067           369    ./refs/remotes/origin/environments/review-whaber-oct-nurdra/deployments
   439489         9.126    17:22:22.846190    17:22:23.380564           363    ./refs/remotes/origin/environments/review-5323-webpa-vqko7l
   439489         9.086    17:22:22.846694    17:22:23.380564           362    ./refs/remotes/origin/environments/review-5323-webpa-vqko7l/deployments
   439489         9.033    17:18:07.619131    17:18:07.724842            72    ./refs/remotes/origin/environments/review-mw-netherl-km32fg
   439489         8.983    17:18:07.619773    17:18:07.724842            71    ./refs/remotes/origin/environments/review-mw-netherl-km32fg/deployments
   439489         8.972    17:21:00.660323    17:21:00.741301            22    ./refs/remotes/origin/environments/review-davis-town-z4xj86
   439489         8.901    17:21:00.660998    17:21:00.741301            21    ./refs/remotes/origin/environments/review-davis-town-z4xj86/deployments
   439489         8.746    16:47:26.375465    16:47:26.552648           133    ./refs/remotes/origin/environments/review-eread-tw-r-15s0is
   439489         8.722    16:47:26.376159    16:47:26.552648           132    ./refs/remotes/origin/environments/review-eread-tw-r-15s0is/deployments
   439489         8.494    17:16:40.035936    17:16:40.697419           364    ./refs/remotes/origin/environments/review-jarv-year-dqr7k8
   439489         8.435    17:16:40.036592    17:16:40.697419           363    ./refs/remotes/origin/environments/review-jarv-year-dqr7k8/deployments
   439489         8.375    17:11:18.547638    17:11:19.224473           337    ./refs/remotes/origin/environments/review-marketing-iltmsc
   439489         8.354    17:11:18.567967    17:11:19.224473           336    ./refs/remotes/origin/environments/review-marketing-iltmsc/deployments
   439489         7.590    17:14:12.211669    17:14:12.841824           252    ./refs/remotes/origin/environments/review-patch-1982-f02a4t

What it shows

Of the overall duration of ~14.5 seconds for git log, ~11.1 seconds are spent reading files in the refs/remotes/origin/environments directory. ~2.8 seconds are spent in refs/remotes/origin/environments/production (basically all in refs/remotes/origin/environments/production/deployments) and the rest is predominantly in review-* environments.

If we look at the environment dashboard we can see some current numbers for www-gitlab-com:

image

The number of active review apps seems quite high to me relative to the number of open MRs (currently 1193), but the 10095 stopped environments are the major concern, and the vast majority are stopped review app environments.

Suggestions

I have a few suggestions but I'm not overly familiar with the background for the environments system.

  1. Turn off auto-creation of review apps on problematic repositories and clean up all the stale environments
  2. Prune "stopped" environments far more aggressively, or start doing it if we're not already doing so, either in general or specifically for review apps
  3. Make the pruning of "stopped" environments something configurable per project, if it isn't already

I'm open to other ideas and options we can explore for this.

Edited by Robert May