Skip to content

Remove /api/:version/jobs/:id/trace dependence on shared NFS infrastructure

Yesterdays ~P1 ~S1 incident, production#1419, involved the API endpoint /api/:version/jobs/:id/trace, which is already problematic endpoint on GitLab.com (see gitlab-org/gitlab#33658 (closed), https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8096#note_227918740, #39 (closed), https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8076, https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4667#note_198108674 for more colour on this)

What we learned yesterday was that this endpoint is highly sensitive to NFS latency. Project exports saturated the NFS fleet, which led to this endpoint running 5x slower than normal. Since it already dominates traffic to the API fleet, this quickly inundated the API unicorn workers leading to major queuing and latency spikes.

The reason for this is that traces are stored on an NFS volume prior to being written to object storage.

There is work being done to move from NFS to Redis, but this has also stalled on problems: see https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4667#history for details

Looking at NFS traffic on an API node, it seems to be dominated by traces. I found little evidence of other NFS usage from these nodes when using nfsslower

andrewn@api-01-sv-gprd.c.gitlab-production.internal:~$ sudo /usr/share/bcc/tools/nfsslower 0
Tracing NFS operations... Ctrl-C to quit
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
08:11:21 bundle         8739   O 0       0           0.27 363608995.log
08:11:21 bundle         8739   W 154     26          0.94 363608995.log
08:11:21 bundle         24124  O 0       0           0.30 363605534.log
08:11:21 bundle         24124  W 9       44          0.01 363605534.log
08:11:21 bundle         16976  O 0       0           0.19 363611080.log
08:11:21 bundle         16976  W 207     0           2.20 363611080.log
08:11:21 bundle         12008  O 0       0           0.23 363586851.log
08:11:21 bundle         12008  W 88      7           0.20 363586851.log
08:11:22 bundle         12008  O 0       0           0.23 363604755.log
08:11:22 bundle         12008  W 73      3           0.26 363604755.log
08:11:22 bundle         29795  O 0       0           0.26 363608595.log
08:11:22 bundle         29795  W 1093    99          0.31 363608595.log
08:11:22 bundle         7263   O 0       0           0.28 363594705.log
08:11:22 bundle         7263   W 1706    150         1.02 363594705.log
08:11:22 bundle         9393   O 0       0           0.22 363594804.log
08:11:22 bundle         9393   W 514     106         0.28 363594804.log
08:11:22 bundle         24124  O 0       0           0.27 363607733.log
08:11:22 bundle         24124  W 217     817         0.25 363607733.log
08:11:22 bundle         4395   O 0       0           0.29 363595694.log
08:11:22 bundle         4395   W 368     92          0.25 363595694.log
08:11:22 bundle         7219   O 0       0           0.26 363596644.log
08:11:22 bundle         7219   W 143     110         0.01 363596644.log
08:11:22 bundle         8739   O 0       0           0.26 363609792.log
08:11:22 bundle         8739   W 42      697         0.25 363609792.log
08:11:22 bundle         12008  O 0       0           0.22 363610772.log
08:11:22 bundle         12008  W 181     14          0.26 363610772.log
08:11:23 bundle         16976  O 0       0           0.26 363596483.log
08:11:23 bundle         16976  W 379     123         0.32 363596483.log
08:11:23 bundle         27727  O 0       0           0.26 363607832.log
08:11:23 bundle         27727  W 215     0           0.27 363607832.log
08:11:23 bundle         18935  O 0       0           0.24 363594717.log
08:11:23 bundle         18935  W 294     199         0.28 363594717.log
08:11:23 bundle         13301  O 0       0           0.22 363609564.log
08:11:23 bundle         13301  W 6192    353         0.31 363609564.log
08:11:23 bundle         18935  O 0       0           0.19 363603165.log
08:11:23 bundle         18935  W 1203    333         0.32 363603165.log
08:11:23 bundle         23273  O 0       0           0.20 363609274.log
08:11:23 bundle         23273  W 1446    79          1.17 363609274.log
08:11:23 bundle         9722   O 0       0           0.23 363611067.log
08:11:23 bundle         9722   W 181     3           0.01 363611067.log
08:11:23 bundle         7219   O 0       0           0.17 363588761.log
08:11:23 bundle         7219   W 148     40          1.78 363588761.log
08:11:23 bundle         25640  O 0       0           0.25 363592164.log

Proposal

As discussed with @jarv, one stopgap could be to segment /api/:version/jobs/:id/trace into its own isolated fleet, possibly with its own isolated NFS share (just for this purpose) while we wait for other solutions.

cc @marin