Remove /api/:version/jobs/:id/trace dependence on shared NFS infrastructure
Yesterdays ~P1 ~S1 incident, production#1419, involved the API endpoint /api/:version/jobs/:id/trace
, which is already problematic endpoint on GitLab.com (see gitlab-org/gitlab#33658 (closed), https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8096#note_227918740, #39 (closed), https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8076, https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4667#note_198108674 for more colour on this)
What we learned yesterday was that this endpoint is highly sensitive to NFS latency. Project exports saturated the NFS fleet, which led to this endpoint running 5x slower than normal. Since it already dominates traffic to the API fleet, this quickly inundated the API unicorn workers leading to major queuing and latency spikes.
The reason for this is that traces are stored on an NFS volume prior to being written to object storage.
There is work being done to move from NFS to Redis, but this has also stalled on problems: see https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4667#history for details
Looking at NFS traffic on an API node, it seems to be dominated by traces. I found little evidence of other NFS usage from these nodes when using nfsslower
andrewn@api-01-sv-gprd.c.gitlab-production.internal:~$ sudo /usr/share/bcc/tools/nfsslower 0
Tracing NFS operations... Ctrl-C to quit
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
08:11:21 bundle 8739 O 0 0 0.27 363608995.log
08:11:21 bundle 8739 W 154 26 0.94 363608995.log
08:11:21 bundle 24124 O 0 0 0.30 363605534.log
08:11:21 bundle 24124 W 9 44 0.01 363605534.log
08:11:21 bundle 16976 O 0 0 0.19 363611080.log
08:11:21 bundle 16976 W 207 0 2.20 363611080.log
08:11:21 bundle 12008 O 0 0 0.23 363586851.log
08:11:21 bundle 12008 W 88 7 0.20 363586851.log
08:11:22 bundle 12008 O 0 0 0.23 363604755.log
08:11:22 bundle 12008 W 73 3 0.26 363604755.log
08:11:22 bundle 29795 O 0 0 0.26 363608595.log
08:11:22 bundle 29795 W 1093 99 0.31 363608595.log
08:11:22 bundle 7263 O 0 0 0.28 363594705.log
08:11:22 bundle 7263 W 1706 150 1.02 363594705.log
08:11:22 bundle 9393 O 0 0 0.22 363594804.log
08:11:22 bundle 9393 W 514 106 0.28 363594804.log
08:11:22 bundle 24124 O 0 0 0.27 363607733.log
08:11:22 bundle 24124 W 217 817 0.25 363607733.log
08:11:22 bundle 4395 O 0 0 0.29 363595694.log
08:11:22 bundle 4395 W 368 92 0.25 363595694.log
08:11:22 bundle 7219 O 0 0 0.26 363596644.log
08:11:22 bundle 7219 W 143 110 0.01 363596644.log
08:11:22 bundle 8739 O 0 0 0.26 363609792.log
08:11:22 bundle 8739 W 42 697 0.25 363609792.log
08:11:22 bundle 12008 O 0 0 0.22 363610772.log
08:11:22 bundle 12008 W 181 14 0.26 363610772.log
08:11:23 bundle 16976 O 0 0 0.26 363596483.log
08:11:23 bundle 16976 W 379 123 0.32 363596483.log
08:11:23 bundle 27727 O 0 0 0.26 363607832.log
08:11:23 bundle 27727 W 215 0 0.27 363607832.log
08:11:23 bundle 18935 O 0 0 0.24 363594717.log
08:11:23 bundle 18935 W 294 199 0.28 363594717.log
08:11:23 bundle 13301 O 0 0 0.22 363609564.log
08:11:23 bundle 13301 W 6192 353 0.31 363609564.log
08:11:23 bundle 18935 O 0 0 0.19 363603165.log
08:11:23 bundle 18935 W 1203 333 0.32 363603165.log
08:11:23 bundle 23273 O 0 0 0.20 363609274.log
08:11:23 bundle 23273 W 1446 79 1.17 363609274.log
08:11:23 bundle 9722 O 0 0 0.23 363611067.log
08:11:23 bundle 9722 W 181 3 0.01 363611067.log
08:11:23 bundle 7219 O 0 0 0.17 363588761.log
08:11:23 bundle 7219 W 148 40 1.78 363588761.log
08:11:23 bundle 25640 O 0 0 0.25 363592164.log
Proposal
As discussed with @jarv, one stopgap could be to segment /api/:version/jobs/:id/trace
into its own isolated fleet, possibly with its own isolated NFS share (just for this purpose) while we wait for other solutions.
cc @marin