Clear up stale environments
Overview
We should automatically clear up stale environments due to the performance impact they have on busy repositories. This is a major contributor to slowness on gitlab-com/www-gitlab-com
and gitlab-org/gitlab
.
The Problem
As discovered in #222247 (closed), environments are persisted as refs in the pool repository. Certain git
commands tend to read every single ref, so with a few thousand stale environment refs in the repository a lot of commands slow down dramatically.
This is a summary of directories being accessed when performing git log
on a copy of the www-gitlab-com
pool repository from a couple of months back. This copy was grabbed from production by @stanhu which we then loaded onto a test server, and a trace was then gathered of git log
using the following command:
strace -fttTyyy -s 1024 -o pool.trace git -C @pools/6b/86/6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b.git log
Directories accessed
Summarised using our super-cool strace-parser:
❯ strace-parser ../traces/pool.trace directories -s duration | head -n 50 11:32:48
Directories accessed for files
pid dur (ms) first time last time open ct directory name
------- ---------- --------------- --------------- ---------- --------------
439489 14572.087 16:38:19.930060 17:24:36.367978 531277 .
439489 14483.887 16:38:19.933964 17:24:20.892514 530466 ./refs
439489 13078.097 16:44:55.051814 17:24:20.849024 477260 ./refs/remotes
439489 13078.075 16:44:55.052231 17:24:20.849024 477259 ./refs/remotes/origin
439489 11191.357 16:45:53.286244 17:22:29.256390 404550 ./refs/remotes/origin/environments
439489 2803.834 16:47:54.689394 17:11:09.703747 92216 ./refs/remotes/origin/environments/production
439489 2803.811 16:47:54.690005 17:11:09.703747 92215 ./refs/remotes/origin/environments/production/deployments
439489 1405.458 16:38:20.009563 16:44:55.038956 53198 ./refs/dangling
439489 1340.257 17:22:29.828930 17:24:20.439701 51161 ./refs/remotes/origin/merge-requests
439489 276.339 16:45:18.696713 16:45:44.403715 11315 ./refs/remotes/origin/heads
439489 182.524 17:17:02.602512 17:17:26.665644 7051 ./refs/remotes/origin/environments/staging
439489 182.465 17:17:02.602993 17:17:26.665644 7050 ./refs/remotes/origin/environments/staging/deployments
439489 87.978 16:38:19.934628 17:24:36.367978 804 ./objects
439489 86.909 16:45:44.526967 16:45:53.139417 3318 ./refs/remotes/origin/pipelines
439489 70.908 16:38:20.002377 17:24:33.024496 679 ./objects/pack
439489 19.224 17:13:20.292959 17:13:20.332002 8 ./refs/remotes/origin/environments/review-remove-mm-snx3af
439489 18.696 17:15:46.827908 17:15:48.201167 779 ./refs/remotes/origin/environments/review-release-13-2ke04s
439489 18.663 17:12:36.161023 17:12:36.248326 32 ./refs/remotes/origin/environments/review-fkurniadi-zivfx1
439489 18.638 17:15:46.828488 17:15:48.201167 778 ./refs/remotes/origin/environments/review-release-13-2ke04s/deployments
439489 17.665 17:22:01.950412 17:22:02.719233 468 ./refs/remotes/origin/environments/review-sm-invento-ouxvxi
439489 17.627 17:22:01.950832 17:22:02.719233 467 ./refs/remotes/origin/environments/review-sm-invento-ouxvxi/deployments
439489 16.678 17:14:22.689576 17:14:23.956135 721 ./refs/remotes/origin/environments/review-release-13-ne30vp
439489 16.610 17:14:22.690432 17:14:23.956135 720 ./refs/remotes/origin/environments/review-release-13-ne30vp/deployments
439489 13.668 17:13:55.642183 17:13:55.897390 62 ./refs/remotes/origin/environments/review-handbook-p-wc99xn
439489 12.647 17:14:09.238043 17:14:10.194415 529 ./refs/remotes/origin/environments/review-dotcom-cat-2ch6ww
439489 12.597 17:14:09.239200 17:14:10.194415 528 ./refs/remotes/origin/environments/review-dotcom-cat-2ch6ww/deployments
439489 10.299 17:15:41.987349 17:15:42.583203 421 ./refs/remotes/origin/environments/review-span-of-in-lgio2x
439489 10.219 17:15:41.988145 17:15:42.583203 420 ./refs/remotes/origin/environments/review-span-of-in-lgio2x/deployments
439489 9.473 17:14:03.657118 17:14:04.215349 286 ./refs/remotes/origin/environments/review-julia-lake-v7g6k6
439489 9.429 17:14:03.657647 17:14:04.215349 285 ./refs/remotes/origin/environments/review-julia-lake-v7g6k6/deployments
439489 9.273 17:12:16.331148 17:12:17.020067 370 ./refs/remotes/origin/environments/review-whaber-oct-nurdra
439489 9.250 17:12:16.331864 17:12:17.020067 369 ./refs/remotes/origin/environments/review-whaber-oct-nurdra/deployments
439489 9.126 17:22:22.846190 17:22:23.380564 363 ./refs/remotes/origin/environments/review-5323-webpa-vqko7l
439489 9.086 17:22:22.846694 17:22:23.380564 362 ./refs/remotes/origin/environments/review-5323-webpa-vqko7l/deployments
439489 9.033 17:18:07.619131 17:18:07.724842 72 ./refs/remotes/origin/environments/review-mw-netherl-km32fg
439489 8.983 17:18:07.619773 17:18:07.724842 71 ./refs/remotes/origin/environments/review-mw-netherl-km32fg/deployments
439489 8.972 17:21:00.660323 17:21:00.741301 22 ./refs/remotes/origin/environments/review-davis-town-z4xj86
439489 8.901 17:21:00.660998 17:21:00.741301 21 ./refs/remotes/origin/environments/review-davis-town-z4xj86/deployments
439489 8.746 16:47:26.375465 16:47:26.552648 133 ./refs/remotes/origin/environments/review-eread-tw-r-15s0is
439489 8.722 16:47:26.376159 16:47:26.552648 132 ./refs/remotes/origin/environments/review-eread-tw-r-15s0is/deployments
439489 8.494 17:16:40.035936 17:16:40.697419 364 ./refs/remotes/origin/environments/review-jarv-year-dqr7k8
439489 8.435 17:16:40.036592 17:16:40.697419 363 ./refs/remotes/origin/environments/review-jarv-year-dqr7k8/deployments
439489 8.375 17:11:18.547638 17:11:19.224473 337 ./refs/remotes/origin/environments/review-marketing-iltmsc
439489 8.354 17:11:18.567967 17:11:19.224473 336 ./refs/remotes/origin/environments/review-marketing-iltmsc/deployments
439489 7.590 17:14:12.211669 17:14:12.841824 252 ./refs/remotes/origin/environments/review-patch-1982-f02a4t
What it shows
Of the overall duration of ~14.5 seconds for git log
, ~11.1 seconds are spent reading files in the refs/remotes/origin/environments
directory. ~2.8 seconds are spent in refs/remotes/origin/environments/production
(basically all in refs/remotes/origin/environments/production/deployments
) and the rest is predominantly in review-*
environments.
If we look at the environment dashboard we can see some current numbers for www-gitlab-com
:
The number of active review apps seems quite high to me relative to the number of open MRs (currently 1193), but the 10095 stopped environments are the major concern, and the vast majority are stopped review app environments.
Suggestions
I have a few suggestions but I'm not overly familiar with the background for the environments system.
- Turn off auto-creation of review apps on problematic repositories and clean up all the stale environments
- Prune "stopped" environments far more aggressively, or start doing it if we're not already doing so, either in general or specifically for review apps
- Make the pruning of "stopped" environments something configurable per project, if it isn't already
I'm open to other ideas and options we can explore for this.