Tracing: Support force collection telemetry traces for a user/project with feature flag
After some improvements added to the Gitaly distributed tracing, I think it is in a good shape now. I can start to use it to debug some problems on production (this one for example). The collected traces are informative. It describes the full flow of a RPC perfectly.
In the future, after we integrate trace2, we can even have a more profound look into git's internal process. Of course, it comes with a cost: overhead. During the way, the process creates span objects. After done, the library sends the traces to the collector. It's a giant burden for both the Gitaly process and the Collector if we enable tracing for all RPC requests. At the moment, we are sampling the request at 0.1% rate.
This rate makes debugging on production less effective. We completely depend on our luck with the hope that any troublesome request is sampled. That's particularly true for less-common, low RPS requests. In this issue, I would like to propose force collecting telemetry traces using a feature flag.
The idea is simple. A folk can turn on an operational feature flag for a user or project, for example /chatops run feature set gitaly_dangerous_force_collect_traces true --user=qmnguyen0711
. Afterward, all Gitaly traces of requests issued matched actors are all collected. To prevent the massive load if someone turns the flag for a massive audience, we can add a simple rate limiter before activating the span.
One possible approach is to implement a gRPC interceptor in Gitaly. This interceptor checks for the feature flag, then enables the trace. However, we'll need to add this feature support to labkit.
Outside Gitlab production environment, this feature is helpful to debug a customer problem remotely. We can guide a customer to enable this flag. After they send the traces to use, we can easily visualize the full flow on our side. That adds an ergonomic option apart from logs and pair-programming with customers.